CN118838701A

CN118838701A - Data processing method, apparatus, device, storage medium and computer program product

Info

Publication number: CN118838701A
Application number: CN202411312759.3A
Authority: CN
Inventors: 谢超华
Original assignee: Shenzhen Aotian Technology Co ltd
Current assignee: Shenzhen Aotian Technology Co ltd
Priority date: 2024-09-20
Filing date: 2024-09-20
Publication date: 2024-10-25

Abstract

The present application relates to the field of big data technology, and discloses a data processing method, device, equipment, storage medium and computer program product, including: dividing the pre-processed data stored in the database according to the data type and the expected access mode to obtain each divided data set; determining the processing logic corresponding to each divided data set; decomposing the processing logic into multiple micro-processing tasks according to the processing type, and sorting the micro-processing tasks according to the scheduling algorithm to obtain the optimal processing sequence; batch processing and stream processing of the data in each divided data set according to the optimal processing sequence and processing logic to obtain the target data and analysis results. The present application divides the data and determines the corresponding processing strategy. After obtaining the processing sequence according to the processing strategy, distributed computing and real-time stream processing are used to improve the data processing efficiency and data quality, meeting the real-time data analysis needs of the target terminal.

Description

Data processing method, apparatus, device, storage medium and computer program product

Technical Field

The present application relates to the field of big data technologies, and in particular, to a data processing method, apparatus, device, storage medium, and computer program product.

Background

At present, along with the exponential increase of data volume, the traditional data processing technology faces to low efficiency of mass data, so that data analysis results cannot be generated in time, business decisions are influenced, and the requirements of real-time processing and quick response are difficult to meet.

Disclosure of Invention

The application mainly aims to provide a data processing method, a device, equipment, a storage medium and a computer program product, and aims to solve the technical problems that as the data volume grows exponentially, the traditional data processing technology faces to low efficiency of mass data, the data analysis result cannot be generated in time, the business decision is affected, and the requirements of real-time processing and quick response are difficult to meet.

In order to achieve the above object, the present application provides a data processing method, which includes:

Dividing the preprocessing data stored in the database according to the data type and the expected access mode to obtain each divided data set;

determining processing logic corresponding to each divided data set;

Decomposing the processing logic into a plurality of micro-processing tasks according to the processing type, and sequencing the micro-processing tasks according to a scheduling algorithm to obtain an optimal processing sequence;

and carrying out batch processing and stream processing on the data in each divided data set according to the optimal processing sequence and the processing logic to obtain target data and analysis results.

Optionally, the step of decomposing the processing logic into a plurality of micro-processing tasks according to a processing type, and sequencing the micro-processing tasks according to a preset scheduling algorithm to obtain an optimal processing sequence includes:

identifying the processing logic to obtain key steps and dependency relationships;

Decomposing the processing logic into a plurality of micro-processing tasks according to the attribute of the key step, wherein the attribute is at least one of independence, complexity and execution time;

and determining a scheduling algorithm according to the dependency relationship, and sequencing the micro-processing tasks based on the scheduling algorithm to obtain an optimal processing sequence.

Optionally, the step of determining processing logic corresponding to each of the divided data sets includes:

Extracting keywords of each divided data set;

Classifying the keywords based on the service field, the data type and the processing requirement, and mapping the keywords to corresponding processing logic templates or flows according to classification results to obtain a mapping relation table;

And determining processing logic corresponding to each divided data set according to the keywords and the mapping relation table.

Optionally, before the step of dividing the preprocessed data stored in the storage module according to the data type and the expected access mode to obtain each divided data set, the method further includes:

Periodically collecting data from a data source based on a data routing mechanism, and preprocessing the collected data to obtain preprocessed data;

generating a hierarchical storage strategy according to the data type and the access requirement of the preprocessed data, and storing the preprocessed data into a database based on the hierarchical storage strategy.

Optionally, after the step of performing batch processing and stream processing on the data in each of the divided data sets according to the optimal processing sequence and the processing logic to obtain the target data and the analysis result, the method further includes:

comparing the analysis result with a preset data quality threshold;

generating a data quality management strategy when the integrity and consistency of the target data are smaller than the preset data quality threshold;

and cleaning and controlling the quality of the processed data based on the data quality management strategy to obtain display data.

Optionally, after the step of cleaning and quality controlling the processing data based on the data quality management policy to obtain the display data, the method further includes:

Acquiring the service requirement of a target terminal;

establishing a demand index table according to the service demand, the display data and the analysis result;

And when the requirement sent by the terminal is detected, sending the display data and the analysis result to the target terminal according to the requirement index table.

In addition, to achieve the above object, the present application also provides a data processing apparatus, including:

the data dividing module is used for dividing the preprocessed data stored in the storage module according to the data type and the expected access mode to obtain divided data sets;

the logic determining module is used for determining processing logic corresponding to each divided data set;

the sequence generation module is used for decomposing the processing logic into a plurality of processing tasks according to the processing types, and sequencing the processing tasks according to a preset scheduling algorithm to obtain an optimal processing sequence;

And the data analysis module is used for carrying out batch processing and stream processing on the data in each divided data set according to the optimal processing sequence and the processing logic to obtain a data report and an analysis result.

In addition, in order to achieve the above object, the present application also provides a data processing apparatus including: a memory, a processor and a data processing program stored on the memory and executable on the processor, which when executed by the processor implements the data processing method as described above.

In addition, in order to achieve the above object, the present application also provides a storage medium having stored thereon a data processing program which, when executed by a processor, implements the data processing method as described above.

Furthermore, to achieve the above object, the present application provides a computer program product comprising a data processing program which, when executed by a processor, implements a data processing method as described above.

The application discloses a method for dividing preprocessing data stored in a database according to data types and expected access modes to obtain divided data sets; determining processing logic corresponding to each divided data set; decomposing the processing logic into a plurality of micro-processing tasks according to the processing type, and sequencing the micro-processing tasks according to a scheduling algorithm to obtain an optimal processing sequence; and carrying out batch processing and stream processing on the data in each divided data set according to the optimal processing sequence and the processing logic to obtain target data and analysis results. According to the application, the data is divided, the corresponding processing strategy is determined, and after the processing sequence is obtained according to the processing strategy, the data processing efficiency and the data quality are improved by adopting the decentralized calculation and the real-time stream processing, so that the real-time data analysis requirement of the target terminal is met.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a flow chart of a first embodiment of a data processing method according to the present application;

FIG. 2 is a flow chart of a second embodiment of the data processing method of the present application;

FIG. 3 is an exploded view of an e-commerce data analysis processing task according to a second embodiment of the present application;

FIG. 4 is a step dependency graph of e-commerce data analysis in accordance with a second embodiment of the present application;

FIG. 5 is a flowchart of a third embodiment of a data processing method according to the present application;

FIG. 6 is a flowchart of a fourth embodiment of a data processing method according to the present application;

FIG. 7 is a schematic block diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a hardware operating environment related to a data processing method according to an embodiment of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the technical solution of the present application and are not intended to limit the present application.

For a better understanding of the technical solution of the present application, the following detailed description will be given with reference to the drawings and the specific embodiments.

The main solutions of the embodiments of the present application are: dividing the preprocessing data stored in the database according to the data type and the expected access mode to obtain each divided data set; determining processing logic corresponding to each of the partitioned data sets; decomposing the processing logic into a plurality of micro-processing tasks according to the processing type, and sequencing the micro-processing tasks according to a scheduling algorithm to obtain an optimal processing sequence; and carrying out batch processing and stream processing on the data in each divided data set according to the optimal processing sequence and the processing logic to obtain target data and analysis results.

Because the current big data technology is widely applied in various industries, related technologies and products are endless from data acquisition, storage, processing and analysis. However, with the exponential increase of data volume, the conventional data processing technology and architecture face massive data, so that it is difficult to meet the requirements of real-time processing and quick response, the conventional data processing system needs to consume a large amount of hardware investment and maintenance cost, and the data processing efficiency is low, so that the data analysis result cannot be generated in time, the business decision is affected, and the rapid-change market requirement is difficult to adapt.

Therefore, the application provides a novel data processing platform, and realizes the real-time acquisition, processing and analysis of data by constructing an expandable and easy-to-manage data processing system, thereby improving the data processing efficiency and data quality and meeting the real-time data analysis requirements of enterprises.

The execution body of the present embodiment may be a computer service device, such as a computer, having functions of data processing, network communication, and program running, or an electronic device capable of implementing the above functions. The present embodiment and the following embodiments will be described below by taking a data processing platform as an example.

Based on this, an embodiment of the present application provides a data processing method, and referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of the data processing method of the present application.

In this embodiment, the data processing method includes:

Step S10, dividing the preprocessing data stored in the database according to the data type and the expected access mode to obtain each divided data set.

It should be noted that the data type refers to a feature or property of each data item or field in the database, including a size, a length, a default value, a constraint condition, and the like of the data. The expected access pattern refers to an expected access manner of the target user or application program to the data stored in the database, for example, the frequency of data being queried, the complexity of query (such as whether multi-table connection, complex calculation, etc. is involved), the time distribution of data access (such as peak period, valley period, etc.), etc. all affect the access pattern.

Additionally, it should be noted that data partitioning is a process of partitioning a large amount of data into a plurality of smaller, easily managed portions according to certain criteria (such as data type, access frequency, geographic location, etc.), which may improve query efficiency, enhance data security, and facilitate data migration and maintenance. Dividing the data sets refers to each independent data set obtained after data division, and the data can be processed according to different processing methods.

It will be appreciated that the strategy for formulating data partitioning based on the analysis of data type and expected access patterns may be based on target range partitioning, such as partitioning data within a date range into different data sets; it may also be a hash-based partitioning, such as mapping data into different data sets by a hash function; it may also be based on a division of functions, such as storing data of related functions together.

Step S20, determining processing logic corresponding to each of the divided data sets.

It should be noted that the processing logic is the sum of a series of steps, rules and algorithms for processing data, which can achieve a specific business objective or data requirement.

Specifically, when determining the processing logic corresponding to each divided data set, it is necessary to know the data type and the structure of the data contained in each data set, and according to the basis of defining the processing target of each data set when dividing the data set, define which query operations need to be supported by the data set, such as simple data retrieval, complex data aggregation, real-time analysis, and consider whether the data set needs to support the operations of inserting, updating and deleting data, the frequency and complexity of these operations, and so on.

Further, in order to ensure that each data set is processed in a manner most suitable for its characteristics, step S20 may include: extracting keywords of each divided data set; classifying the keywords based on the service field, the data type and the processing requirement, and mapping the keywords to corresponding processing logic templates or flows according to classification results to obtain a mapping relation table; and determining processing logic corresponding to each divided data set according to the keywords and the mapping relation table.

And step S30, decomposing the processing logic into a plurality of micro-processing tasks according to the processing type, and sequencing the micro-processing tasks according to a scheduling algorithm to obtain an optimal processing sequence.

It should be noted that the micro-processing task is decomposed by the overall processing logic into multiple smaller, manageable processing steps during data processing in order to improve efficiency and optimize resource usage, such that each step is easier to process, test, and optimize individually. In addition, there may be dependencies between the micro-processing tasks, i.e., execution of certain tasks may require waiting for other tasks to complete their output. An optimal processing sequence refers to a task execution order that is obtained by a scheduling algorithm and that can maximize system performance or minimize some cost under given conditions.

It should be understood that during the decomposition process, each micro-processing task is ensured to be logically independent as much as possible, so as to reduce the dependency relationship among the tasks, improve the possibility of parallel processing, and simultaneously consider the position of the data on the physical storage, put together the tasks operating on the same data or adjacent data as much as possible, so as to reduce the cost of data movement.

It can be understood that, according to the dependency relationship between tasks and the availability of system resources, the micro-processing tasks may be ordered by a first-come first-served algorithm or may be ordered by a critical path method, and, of course, in order to reduce the average waiting time, the task with the shortest predicted execution time may be preferentially selected by using the shortest job priority algorithm to obtain the optimal processing sequence. The present embodiment is not limited thereto.

And step S40, carrying out batch processing and stream processing on the data in each divided data set according to the optimal processing sequence and the processing logic to obtain target data and analysis results.

It should be noted that batch processing is a data processing mode suitable for a large amount of data, and the computing resources are more effectively utilized by integrating a large amount of data and performing processing at once in a concentrated period of time. Stream processing is a data processing mode that processes data immediately after it arrives, without waiting for the entire data set to be complete and then processing, and can be used to process large amounts of data in real time and to analyze and calculate the data in real time.

Specifically, when batch processing and stream processing are performed on the data in each divided data set according to the optimal processing sequence and the processing logic, a batch processing mode is adopted for the data part which does not need to respond in real time, and defined processing logic is performed on the divided data sets one by one or in parallel according to the optimal processing sequence, wherein the processing logic comprises preprocessing steps such as data cleaning, conversion and the like, so that final target data is obtained, and the subsequent statistical analysis or machine learning model application is facilitated. And for the data part needing real-time processing, adopting a stream processing mode, and obtaining a real-time analysis result by configuring a stream processing system to receive the real-time data stream from the data source and processing and analyzing the data stream in real time according to processing logic.

The following is illustrative for ease of understanding, but is not meant to limit the application. In one example, in order to analyze purchasing behavior of a user, sales trend of goods, etc., a large-scale electronic commerce platform needs to generate massive transaction data every day for processing. The data is divided into a plurality of data sets, each data set comprises transaction records within a certain time range, duplicate transaction records are removed firstly, wrong transaction information such as wrong commodity price, time stamp and the like is corrected, missing necessary information such as user ID, commodity ID and the like is filled, and meanwhile, historical transaction data are subjected to aggregation calculation such as calculation of total sales of each commodity, purchase times of each user and the like. And classifying and counting sales trends of different time periods (such as months, quarters and years), and calculating sales of the current time period, updating a popular commodity ranking list and the like in real time.

In the present embodiment, it is disclosed that preprocessing data stored in a database is divided according to a data type and an expected access pattern, and each divided data set is obtained; determining processing logic corresponding to each of the partitioned data sets; decomposing the processing logic into a plurality of micro-processing tasks according to the processing type, and sequencing the micro-processing tasks according to a scheduling algorithm to obtain an optimal processing sequence; and carrying out batch processing and stream processing on the data in each divided data set according to the optimal processing sequence and the processing logic to obtain target data and analysis results. The data is divided, corresponding processing strategies are determined, and after a processing sequence is obtained according to the processing strategies, the data processing efficiency and the data quality are improved by adopting decentralized calculation and real-time stream processing, so that the real-time data analysis requirement of a target terminal is met.

Referring to fig. 2, fig. 2 is a flow chart of a second embodiment of the data processing method according to the present application, and based on the first embodiment, the second embodiment of the data processing method according to the present application is provided.

In a second embodiment, the step S30 includes:

Step S301, identifying the processing logic, and obtaining a key step and a dependency relationship.

It should be noted that, the key step refers to a step that has a significant impact on the final result in the decomposed processing logic, and generally relates to improvement of data quality, application of a core analysis model, support of key business decisions, and the like. Dependencies refer to interdependencies and constraints that exist between different processing steps or components, which determine the order of data processing, the direction of data flow, and the needs of resource allocation, including order dependence, data dependence, resource dependence, and the like.

And step S302, decomposing the processing logic into a plurality of micro-processing tasks according to the attribute of the key step, wherein the attribute is at least one of independence, complexity and execution time.

It should be noted that a micro-processing task is a series of smaller, more specific, more manageable and executable task elements obtained by decomposing complex processing logic during data processing and analysis.

Specifically, attribute evaluation is performed on each key step, and the evaluation step includes: determining whether the step can be performed independently without relying on the output of other steps, evaluating the computational complexity, logic complexity, and data processing complexity of the step and predicting or measuring the execution time of the step under given resource conditions. According to the evaluation result, each key step is further decomposed into a plurality of micro-processing tasks, and the principles of independence maintenance, complexity balance, execution time and the like are required to be followed during decomposition.

The following is illustrative for ease of understanding, but is not meant to limit the application. In an example, referring to fig. 3, fig. 3 is an exploded view of an e-commerce data analysis processing task according to a second embodiment of the present application. Consider an e-commerce data analysis flow in which key steps include data collection, data cleansing, feature extraction, model training, and result reporting. The key steps are decomposed following the principles of maintaining independence, balancing complexity and execution time, and the results are as follows:

1. data collection

Micro task A1: acquiring order data (high independence, medium complexity, medium execution time) from an e-commerce platform API;

Micro task A2: obtaining user information from a database (high independence, medium complexity, medium execution time);

2. Data cleansing

Micro task B1: removing duplicate items (medium independence, medium complexity, medium execution time) in order data;

Micro task B2: processing missing values and outliers in order data (medium independence, high complexity, long execution time);

micro task B3: cleaning user information data (medium independence, medium complexity, medium execution time);

3. Feature extraction

Micro task C1: extracting purchase frequency features (medium independence, medium complexity, short execution time) from order data;

Micro task C2: extracting age and gender characteristics (medium independence, medium complexity, short execution time) from the user information;

micro task C3: combining the order and the user information, calculating cross features (low independence, high complexity, medium execution time);

4. Model training

Micro task D1: data partitioning (high independence, low complexity, short execution time);

Micro task D2: selecting and initializing a model (medium independence, medium complexity, short execution time);

micro task D3: model training (low independence, high complexity, long execution time);

5. Results reporting

Micro task E1: generating model evaluation reports (low independence, medium complexity, short execution time);

Micro task E2: preparing a visual chart (low independence, medium complexity, short execution time);

Micro task E3: final reports (low independence, low complexity, medium execution time) are written.

Step S303, determining a scheduling algorithm according to the dependency relationship, and sequencing the micro-processing tasks based on the scheduling algorithm to obtain an optimal processing sequence.

It should be understood that the dependency determination scheduling algorithm may be represented by using a tree graph or a linked list according to the dependency determination scheduling algorithm, and of course, a directed acyclic graph may be provided to determine the scheduling algorithm for more clearly representing the dependency.

The following is illustrative for ease of understanding, but is not meant to limit the application. In an example, referring to fig. 4, fig. 4 is a dependency diagram of an e-commerce data analysis step according to a second embodiment of the present application. According to analysis, the dependency relationship between the current tasks is mainly based on the flow direction of data and the logic sequence of processing, and the dependency relationship is obtained based on the logic sequence as follows:

The data collection task is independent and the data cleansing task depends on the results of the data collection task. Specifically, B1 and B2 depend on the results of A1, and B3 depends on the results of A2. The feature extraction task depends on the results of the data cleansing task. C1 depends on the results of B1 and B2, C2 depends on the results of B3, and C3 depends on the results of C1 and C2. The model training task depends on the results of the feature extraction task. D1 may be performed after any of C1, C2, C3 is completed, but D2 and D3 depend on the results of all feature extraction tasks. The result reporting task depends on the results of model training, in particular D3.

Through the above dependency relationship, a DAG (DIRECTED ACYCLIC GRAPH ) can be constructed, and the execution order of tasks is determined using a topology ordering algorithm, with the following results:

1. a1 (acquiring order data from E-commerce platform API)

2. A2 (obtaining user information from a database)

3. B1, B3 (which may be performed in parallel, since B1 depends on the results of A1 and B3 depends on the results of A2)

4. B2 (performed after B1 because B2 also depends on the results of A1 and may require B1 processed data)

5. C1, C2 (which may be performed in parallel, since C1 depends on the results of B1 and B2, and C2 depends on the results of B3)

6. C3 (performed after C1 and C2)

7. D1 (which may be performed after any of C1, C2, C3 is completed)

8. D2 (executed after completion of all of C1, C2, C3)

9. D3 (performed after D2)

10. E1, E2 (which can be performed in parallel since they all depend on the outcome of D3)

11. E3 (performed after E1 and E2)

The optimal processing sequence obtained according to the execution sequence is as follows: a1, A2 > B1, B3 > B2 > C1, C2 > C3, D1 > D2 > D3 > E1, E2 > E3.

In this embodiment, the optimal processing sequence is obtained by decomposing the complex processing logic into a plurality of micro-processing tasks and sequencing and optimizing the tasks according to a preset scheduling algorithm. The efficiency and the accuracy of data processing are improved, and the flexibility and the expandability of the system are also enhanced.

Referring to fig. 5, fig. 5 is a flowchart illustrating a third embodiment of the data processing method according to the present application, and based on the first embodiment, the third embodiment of the data processing method according to the present application is provided.

In a third embodiment, the step S20 includes:

step S201, extracting keywords of each divided dataset.

It should be understood that extracting the keywords of the divided data sets may be counting the number of times each word appears in the current data set, and taking the high-frequency word in the data set as the keyword of the current data set.

It will be appreciated that extracting keywords for each partitioned dataset may also be by calculating a word-locating weight for each dataset using TF-IDF (Term Frequency-inverse document Frequency) and identifying keywords for each dataset based on the weights and probabilities.

Step S202, classifying the keywords based on the service field, the data type and the processing requirement, and mapping the keywords to corresponding processing logic templates or flows according to the classification result to obtain a mapping relation table.

It should be appreciated that prior to classification, a deep understanding of the business fields involved, including its expertise, industry terminology, business processes, etc. is required. Keywords are classified into different categories according to the field of business, data type and processing requirements, and the classification process may be rule-based (e.g., using a predefined keyword list or regular expression) or statistical-based (e.g., using a clustering algorithm).

The following is illustrative for ease of understanding, but is not meant to limit the application. In one example, table 1 is a mapping table comprising keyword categories, =instance keywords, and processing logic templates. The business fields involved include "financial services", "medical health", and "electronic commerce". Keywords can be classified into different categories according to the business domain, data type, and processing requirements, for example:

1. Financial services:

loan application (keywords: loan, interest rate, credit score);

portfolios (keywords: stocks, funds, bonds, benefits);

risk management (keywords: risk, assessment, hedging).

2. Medical health:

Disease diagnosis (keywords: symptoms, diagnosis, treatment);

Drug research (keywords: drugs, clinical trials, efficacy);

Health management (keywords: nutrition, exercise, prevention).

3. E-commerce:

product search (keywords: price, offers, inventory);

user ratings (keywords: rating, scoring, satisfaction);

Shopping cart management (keywords: shopping cart, settlement, payment).

Mapping the keywords of each category onto corresponding processing logic templates or flows according to a predefined set of processing steps or algorithms and automatic triggering according to the category of the keywords, and finally arranging the mapping relation into a table form.

Table 1_mapping table

Step S203, determining processing logic corresponding to each divided data set according to the keyword and the mapping relation table.

It will be appreciated that once a data set is categorized into a particular keyword category, the processing logic template or flow corresponding to that category may be looked up according to a mapping relationship table. Processing logic may include a series of predefined operational steps such as data cleaning, feature extraction, model training, and result analysis.

It should be appreciated that after determining the processing logic, verification is typically required to ensure its validity and accuracy, feedback may be provided by performing a test run on the small-scale data set, and if the processing logic is found to be problematic or inefficient, adjustments and optimizations may be required based on the verification results, such as modifying certain steps in the processing logic, reassigning categories of the data set, or updating the mapping tables, etc.

In this embodiment, it is disclosed to extract keywords of each divided dataset, classify the keywords based on the service field, the data type and the processing requirement, map the keywords to corresponding processing logic templates or flows according to the classification result, obtain a mapping relation table, determine processing logic corresponding to each divided dataset according to the keywords and the mapping relation table, and by customizing corresponding processing logic according to the keywords of each dataset, ensure that each dataset obtains a processing mode most suitable for its characteristics, thereby improving accuracy and relativity of the processing result.

Referring to fig. 6, fig. 6 is a flowchart of a fourth embodiment of the data processing method according to the present application, and based on the first embodiment, the fourth embodiment of the data processing method according to the present application is provided.

In a fourth embodiment, before the step S10, the method further includes:

and step S01, periodically collecting data from a data source based on a data routing mechanism, and preprocessing the collected data to obtain preprocessed data.

It should be noted that the data routing mechanism is a technology or policy used in a data system to manage and control data flows, and determines how data is transferred from a data source to a target system. A data source refers to a home location or system that provides data and may be a database, a file, a sensor network, a Web service, or any other entity capable of generating or storing data. Preprocessing data refers to data obtained by performing a series of preliminary processing operations on the originally acquired data, wherein the processing operations may include data cleaning (such as noise removal and missing value filling), data conversion (such as format conversion and unit unification), data compression, data aggregation and the like.

It should be appreciated that the data routing mechanism specifies the frequency of data collection, collection paths, data screening conditions, etc., and the system reads or retrieves data from the specified data sources (e.g., databases, files, sensors, etc.), the collection process involving the steps of interfacing with the data sources, query execution, data reception, etc.

It will be appreciated that the process of data collection may be repeated at regular intervals, which may be fixed (e.g., hourly, daily) or triggered based on specific events.

And step S02, generating a hierarchical storage strategy according to the data type and the access requirement of the preprocessed data, and storing the preprocessed data into a database based on the hierarchical storage strategy.

It should be noted that a hierarchical storage policy is a data storage management method that allocates data to different storage layers or storage media according to access frequency, importance, size, or other factors.

Specifically, before storing, it is necessary to know the type of the preprocessed data, including its size, structure, complexity and possible access pattern, and then evaluate the access requirements of the data, including access frequency, response time requirements, data recovery time targets, recovery point targets, etc., and then based on the analysis results of the data type and the access requirements, develop a hierarchical storage policy, which should determine which data should be stored on a high-performance storage layer (such as frequently accessed hot spot data), and which data may be stored on a lower-cost storage layer (such as historical data or backup data). And finally, storing the preprocessed data into a database according to the requirements of the hierarchical storage strategy.

In a fourth embodiment, after the step S40, the method further includes:

step S501, comparing the analysis result with a preset data quality threshold.

It should be noted that, the preset data quality threshold is a standard value or an acceptable range of a series of data quality indexes set according to business requirements, industry standards or internal regulations of organizations before data processing and analysis, and includes that the proportion of missing values must not exceed a certain percentage, a threshold set in an abnormal value detection algorithm, an allowable error range in data consistency check, and the like.

It should be appreciated that comparing the analysis result with a preset data quality threshold may be by comparing each indicator of the target data to determine whether it meets the preset threshold, and if so, the indicator is considered acceptable in terms of data quality; if not, a quality problem is considered to exist. And comparing the analysis result with a preset data quality threshold value, namely selecting a specific data index according to service requirements for comparison to form a data quality evaluation report.

Step S502, when the integrity and consistency of the target data are smaller than the preset data quality threshold, generating a data quality management policy.

It should be noted that, data integrity refers to accuracy and reliability of data, and ensures that data remains in an untampered, untouched, and undamaged state during input, processing, transmission, and storage. Data consistency refers to the state in which the same or associated data should remain consistent across different data sources or data systems. This includes logical, physical, and temporal consistency of the data. Data quality management policies in order to improve and ensure that data quality meets business requirements, a combination of plans, measures and flows are formulated, which generally include key elements such as data source control, data processing process optimization, data storage and access management, data quality monitoring and assessment, and problem tracking and solving.

Specifically, when the integrity and consistency of the target data are smaller than the preset data quality threshold, the root cause of the data quality problem needs to be analyzed first, including the problem of the data source, the problem in the data processing process or the problem in the data storage and access process, and the like. When the data quality management strategy is generated, specific and quantifiable data quality management targets are set according to the problem identification and analysis results, such as reducing the proportion of missing data to X, reducing the quantity of inconsistent data to Y and the like, and meanwhile, the priority order for solving the data quality problem is determined according to the service requirements and the resource limitation.

And step S503, cleaning and quality controlling the processing data based on the data quality management strategy to obtain display data.

It should be understood that data cleansing is a step of correcting or deleting errors, inconsistencies, outliers, etc. in data during data processing. Data quality control is a series of inspection and verification activities that are performed after data cleansing. Presentation data is data obtained for presentation to a target user between data processing procedures.

Further, in order to make the data management and distribution process more orderly and efficient, the step S503 further includes:

Acquiring the service requirement of a target terminal;

It should be appreciated that the demand index table is a table or database table for managing and mapping relationships between business demands and corresponding data resources, processing logic, and presentation modes, and can help a system or application to quickly respond to specific demands from different terminals or users, so as to ensure that data can be accurately and efficiently transmitted and presented.

Specifically, when a demand index table is established, various manners such as market research, user interviews, questionnaires, historical data analysis and the like are required, the business demands of target end users or related stakeholders are collected, and the demand index table should be capable of clearly mapping the relationship between each business demand and corresponding display data and analysis results, wherein the index table may include a plurality of fields such as a demand ID, a demand description, a data type, a data source, an analysis dimension, a display manner and the like. The method comprises the steps of detecting a demand request sent by a target terminal in real time through an API interface, a message queue, polling and the like, matching the received demand with a demand index table, finding out corresponding data and analysis results, and sending corresponding display data and analysis results to the target terminal in a proper mode (such as JSON, XML, charts and the like) according to the matching results.

In the embodiment, based on the data acquisition module and the data storage module, the data are acquired from different data sources in real time, the processed data are stored and transferred into the database by adopting a distributed storage technology, and meanwhile, the data management and distribution process is more orderly and efficient by establishing the demand index table. When the terminal sends the demand, the system can rapidly respond and send corresponding display data and analysis results, and high availability and reliability of the data are ensured.

It should be noted that the foregoing examples are only for understanding the present application, and are not to be construed as limiting the data processing method of the present application, and that many forms of simple transformation based on the technical concept are within the scope of the present application.

The present application also provides a data processing apparatus, referring to fig. 7, the data processing apparatus includes:

The data dividing module 10 is configured to divide the preprocessed data stored in the storage module according to the data type and the expected access mode, and obtain each divided data set;

A logic determination module 20, configured to determine processing logic corresponding to each of the divided data sets;

the sequence generating module 30 is configured to decompose the processing logic into a plurality of processing tasks according to a processing type, and order the processing tasks according to a preset scheduling algorithm to obtain an optimal processing sequence;

And the data analysis module 40 is used for carrying out batch processing and stream processing on the data in each divided data set according to the optimal processing sequence and the processing logic to obtain a data report and an analysis result.

The data processing device provided by the application can solve the technical problems that the data analysis result cannot be generated in time due to low efficiency of the traditional data processing technology facing mass data along with exponential increase of data quantity, business decision is influenced, and the requirements of real-time processing and quick response are difficult to meet by adopting the data processing method in the embodiment. Compared with the prior art, the beneficial effects of the data processing device provided by the application are the same as those of the data processing method provided by the embodiment, and other technical features in the data processing device are the same as those disclosed by the method of the embodiment, so that the description is omitted herein.

The present application provides a data processing apparatus, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the data processing method of the first embodiment.

With reference now to FIG. 8, a pictorial representation of a data processing apparatus suitable for use in implementing embodiments of the present application is shown. The data processing apparatus in the embodiment of the present application may include, but is not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (Personal DIGITAL ASSISTANT: personal digital assistants), PADs (Portable Application Description: tablet computers), PMPs (Portable MEDIA PLAYER: portable multimedia players), vehicle-mounted terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The data processing device shown in fig. 8 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the application.

As shown in fig. 8, the data processing apparatus may include a processing device 1001 (e.g., a central processing unit, a graphics processor, etc.), which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage device 1003 into a random access Memory (RAM: random Access Memory) 1004. In the RAM1004, various programs and data required for the operation of the data processing apparatus are also stored. The processing device 1001, the ROM1002, and the RAM1004 are connected to each other by a bus 1005. An input/output (I/O) interface 1006 is also connected to the bus. In general, the following systems may be connected to the I/O interface 1006: input devices 1007 including, for example, a touch screen, touchpad, keyboard, mouse, image sensor, microphone, accelerometer, gyroscope, and the like; an output device 1008 including, for example, a Liquid crystal display (LCD: liquid CRYSTAL DISPLAY), a speaker, a vibrator, and the like; storage device 1003 including, for example, a magnetic tape, a hard disk, and the like; and communication means 1009. The communication means 1009 may allow the data processing device to communicate wirelessly or by wire with other devices to exchange data. While a data processing apparatus having various systems is illustrated in the figures, it is to be understood that not all illustrated systems are required to be implemented or provided. More or fewer systems may alternatively be implemented or provided.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through a communication device, or installed from the storage device 1003, or installed from the ROM 1002. The above-described functions defined in the method of the disclosed embodiment of the application are performed when the computer program is executed by the processing device 1001.

The data processing device provided by the application adopts the data processing method in the embodiment, and can solve the technical problems that the data analysis result cannot be generated in time due to low efficiency of the traditional data processing technology facing mass data along with exponential increase of data quantity, business decision is affected, and the requirements of real-time processing and quick response are difficult to meet. Compared with the prior art, the beneficial effects of the data processing device provided by the application are the same as those of the data processing method provided by the above embodiment, and other technical features of the data processing device are the same as those disclosed in the method of the previous embodiment, and are not described in detail herein.

It is to be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the description of the above embodiments, particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

The present application provides a computer-readable storage medium having computer-readable program instructions (i.e., a computer program) stored thereon for performing the data processing method in the above-described embodiments.

The computer readable storage medium provided by the present application may be, for example, a U disk, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access Memory (RAM: random Access Memory), a Read-Only Memory (ROM: read Only Memory), an erasable programmable Read-Only Memory (EPROM: erasable Programmable Read Only Memory or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM: CD-Read Only Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wire, fiber optic cable, RF (Radio Frequency), and the like, or any suitable combination of the foregoing.

The above-mentioned computer readable storage medium may be contained in a data processing apparatus; or may exist alone without being assembled into the data processing device.

The computer-readable storage medium carries one or more programs that, when executed by a data processing apparatus, cause the data processing apparatus to: dividing the preprocessing data stored in the database according to the data type and the expected access mode to obtain each divided data set; determining processing logic corresponding to each divided data set; decomposing the processing logic into a plurality of micro-processing tasks according to the processing type, and sequencing the micro-processing tasks according to a scheduling algorithm to obtain an optimal processing sequence; and carrying out batch processing and stream processing on the data in each divided data set according to the optimal processing sequence and the processing logic to obtain target data and analysis results.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN: local Area Network) or a wide area network (WAN: wide Area Network), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present application may be implemented in software or in hardware. Wherein the name of the module does not constitute a limitation of the unit itself in some cases.

The readable storage medium provided by the application is a computer readable storage medium, and the computer readable storage medium stores computer readable program instructions (namely computer programs) for executing the data processing method, so that the technical problems that as the data volume grows exponentially, the traditional data processing technology is low in mass data efficiency, the data analysis result cannot be generated timely, the business decision is influenced, and the requirements of real-time processing and quick response are difficult to meet can be solved. Compared with the prior art, the beneficial effects of the computer readable storage medium provided by the application are the same as those of the data processing method provided by the above embodiment, and are not described herein.

The application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of a data processing method as described above.

The computer program product provided by the application can solve the technical problems that as the data volume grows exponentially, the traditional data processing technology faces to low efficiency of mass data, the data analysis result cannot be generated in time, the business decision is affected, and the requirements of real-time processing and quick response are difficult to meet. Compared with the prior art, the beneficial effects of the computer program product provided by the application are the same as those of the data processing method provided by the above embodiment, and are not described herein.

The foregoing description is only a partial embodiment of the present application, and is not intended to limit the scope of the present application, and all the equivalent structural changes made by the description and the accompanying drawings under the technical concept of the present application, or the direct/indirect application in other related technical fields are included in the scope of the present application.

Claims

1. A data processing method, characterized in that the data processing method comprises:

The preprocessed data stored in the database is divided according to data types and expected access patterns to obtain divided data sets;

Determining processing logic corresponding to each of the divided data sets;

Decomposing the processing logic into a plurality of micro-processing tasks according to the processing type, and sorting the micro-processing tasks according to the scheduling algorithm to obtain an optimal processing sequence;

The data in each divided data set is batch processed and stream processed according to the optimal processing sequence and the processing logic to obtain target data and analysis results.

2. The data processing method according to claim 1, wherein the step of decomposing the processing logic into a plurality of micro-processing tasks according to the processing type, and sorting the micro-processing tasks according to a preset scheduling algorithm to obtain an optimal processing sequence comprises:

Identify the processing logic to obtain key steps and dependencies;

Decomposing the processing logic into a plurality of micro-processing tasks according to attributes of the key steps, wherein the attributes are at least one of independence, complexity, and execution time;

A scheduling algorithm is determined according to the dependency relationship, and the microprocessing tasks are sorted based on the scheduling algorithm to obtain an optimal processing sequence.

3. The data processing method according to claim 1, wherein the step of determining the processing logic corresponding to each of the divided data sets comprises:

Extracting keywords from each of the divided data sets;

Classify the keywords based on business fields, data types and processing requirements, and map the keywords to corresponding processing logic templates or processes according to the classification results to obtain a mapping relationship table;

The processing logic corresponding to each divided data set is determined according to the keyword and the mapping relationship table.

4. The data processing method according to claim 1, characterized in that before the step of dividing the preprocessed data stored in the storage module according to the data type and the expected access mode to obtain each divided data set, it also includes:

Periodically collect data from a data source based on a data routing mechanism, and pre-process the collected data to obtain pre-processed data;

A tiered storage strategy is generated according to the data type and access requirements of the pre-processed data, and the pre-processed data is stored in a database based on the tiered storage strategy.

5. The data processing method according to claim 1, characterized in that after the step of performing batch processing and stream processing on the data in each divided data set according to the optimal processing sequence and the processing logic to obtain target data and analysis results, it further comprises:

comparing the analysis result with a preset data quality threshold;

When the integrity and consistency of the target data are less than the preset data quality threshold, generating a data quality management strategy;

The processed data is cleaned and quality controlled based on the data quality management strategy to obtain display data.

6. The data processing method according to claim 5, characterized in that after the step of cleaning and quality controlling the processed data based on the data quality management strategy to obtain the display data, it also includes:

Obtain the business needs of the target terminal;

Establishing a demand index table according to the business demand, the display data and the analysis result;

When the demand sent by the terminal is detected, the display data and the analysis result are sent to the target terminal according to the demand index table.

7. A data processing device, characterized in that the device comprises:

A data partitioning module is used to partition the pre-processed data stored in the storage module according to data types and expected access patterns to obtain partitioned data sets;

A logic determination module, used to determine the processing logic corresponding to each of the divided data sets;

A sequence generation module, used to decompose the processing logic into multiple processing tasks according to the processing type, and sort them according to a preset scheduling algorithm to obtain an optimal processing sequence;

The data analysis module is used to perform batch processing and stream processing on the data in each divided data set according to the optimal processing sequence and the processing logic to obtain data reports and analysis results.

8. A data processing device, characterized in that the data processing device comprises: a memory, a processor, and a data processing program stored in the memory and executable on the processor, wherein when the data processing program is executed by the processor, the data processing method according to any one of claims 1 to 6 is implemented.

9. A storage medium, characterized in that a data processing program is stored on the storage medium, and when the data processing program is executed by a processor, the data processing method according to any one of claims 1 to 6 is implemented.

10. A computer program product, characterized in that the computer program product comprises a data processing program, and when the data processing program is executed by a processor, the data processing method according to any one of claims 1 to 6 is implemented.