CN111339073A - Real-time data processing method and device, electronic equipment and readable storage medium - Google Patents
Real-time data processing method and device, electronic equipment and readable storage medium Download PDFInfo
- Publication number
- CN111339073A CN111339073A CN202010114107.4A CN202010114107A CN111339073A CN 111339073 A CN111339073 A CN 111339073A CN 202010114107 A CN202010114107 A CN 202010114107A CN 111339073 A CN111339073 A CN 111339073A
- Authority
- CN
- China
- Prior art keywords
- data
- real
- time
- task
- user
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a real-time data processing method and device, electronic equipment and a readable storage medium, and relates to the technical field of data processing. The method comprises the following steps: acquiring real-time data; determining target data matched with user data statistical requirements from the real-time data through a real-time consumption task, and consuming the target data into a data warehouse through the real-time consumption task, wherein the real-time consumption task acquires the user data statistical requirements from a pre-generated task configuration file, and the task configuration file comprises relevant parameters representing the user data statistical requirements. In the scheme, the real-time consumption task directly acquires the user data statistical requirement from the task configuration file, so that under the condition that the user data statistical requirement is changed, only the task configuration file needs to be changed, a new real-time consumption task does not need to be generated every time, the time for consuming data can be effectively reduced, and the timeliness of the data is ensured.
Description
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a real-time data processing method and apparatus, an electronic device, and a readable storage medium.
Background
With the development of science and technology, the wide application of big data technology makes it become the key supporting technology that leads the technological progress of numerous industries and promotes the benefit to increase. According to the timeliness of data processing, big data processing systems can be divided into batch big data processing and streaming big data processing. Among them, streaming big data is also called real-time big data.
At present, widely applied stream type big data processing systems include Storm, Flink and the like, and real-time data are loaded into a high-performance memory database one by one for query through stream processing.
In order to meet different requirements of users on data and consume different data into a data warehouse in real time, developers need to develop a real-time task to consume corresponding data into the data warehouse according to different user requirements, but under the condition of large data requirements, the developers need to develop a plurality of real-time tasks, the development of the real-time tasks needs more time, and therefore the data cannot be consumed into the data warehouse in time, namely the timeliness of the data cannot be ensured.
Disclosure of Invention
An object of the embodiments of the present application is to provide a real-time data processing method, an apparatus, an electronic device, and a readable storage medium, so as to solve the problem in the prior art that different consumption tasks need to be developed to consume data in a data warehouse according to different user requirements, which results in untimely data consumption and failure to ensure timeliness of the data.
In a first aspect, an embodiment of the present application provides a real-time data processing method, where the method includes: acquiring real-time data; determining target data matched with user data statistical requirements from the real-time data through a real-time consumption task, and consuming the target data into a data warehouse through the real-time consumption task, wherein the real-time consumption task acquires the user data statistical requirements from a pre-generated task configuration file, and the task configuration file comprises relevant parameters representing the user data statistical requirements.
In the implementation process, the target data matched with the user data statistical requirement is determined from the real-time data through the real-time consumption task, and the target data is consumed in the data warehouse.
Optionally, before the acquiring the real-time data, the method further includes:
acquiring relevant parameters representing the statistical requirements of the user data;
acquiring a pre-generated task configuration template;
and generating the task configuration file based on the related parameters and the task configuration template.
In the implementation process, the task configuration file is generated based on the relevant parameters and the task configuration module, so that under the condition that the user data statistics requirement is changed, only a new task configuration file needs to be generated, a real-time consumption task does not need to be re-developed, and further the resource consumption can be reduced.
Optionally, the consuming, by the real-time consumption task, target data in the real-time data, which is matched with the statistical requirement of the user data, into a data warehouse includes:
consuming the target data to a temporary storage layer of the data warehouse through a real-time consumption task;
performing data cleaning on the target data of the temporary storage layer according to a preset time interval to obtain cleaned data, and storing the cleaned data to a detail layer of the data warehouse;
and responding to the user data statistical demand, summarizing the data stored in the detail layer based on at least one preset dimension to obtain summarized data, and storing the summarized data in a summarizing layer of the data warehouse.
In the implementation process, different data are stored in different layers of the data warehouse, so that the data can be processed in a layered mode conveniently.
Optionally, the performing data cleaning on the target data of the temporary storage layer according to a preset time interval to obtain cleaned data includes:
and performing data cleaning on the target data of the temporary storage layer according to a preset time interval through a pre-generated offline task to obtain cleaned data.
In the implementation process, under the condition of less data quantity, the resource consumption can be reduced by cleaning the data through the off-line task.
Optionally, the user data statistics requirement includes a statistics requirement of an independent visitor UV, the data statistics requirement of the response user is summarized based on at least one preset dimension on the data stored in the detail layer, and the summarized data is obtained, including:
and responding to the statistical requirement aiming at the UV, summarizing the data stored in the detail layer based on the UV, and obtaining summarized UV data.
In the implementation process, the UV data is stored through the wide table, so that the query of the UV data can be facilitated.
Optionally, the summarized UV data is stored using a wide table, and the method further includes:
and updating corresponding indexes in the wide table according to the preset time interval, wherein the corresponding indexes are corresponding parameters used for counting the UV time of the target object.
In the implementation process, the data are updated according to the preset time interval, and the incremental data in the preset time interval can be updated in time so as to collect and count the data in time.
Optionally, the responding to the data statistics requirement of the user, performing summary processing on the data stored in the detail layer based on at least one preset dimension, and obtaining the summarized data includes:
responding to the data statistics requirement of a user through a Spark distributed query engine, and summarizing the data stored in the detail layer based on at least one preset dimension to obtain summarized data.
In the implementation process, the Spark distributed query engine is a fast and general computing engine specially designed for large-scale data processing, and the Spark distributed query engine is used for summarizing data, so that the data processing efficiency can be effectively improved.
In a second aspect, an embodiment of the present application provides a real-time data processing apparatus, where the apparatus includes:
the data acquisition module is used for acquiring real-time data;
the data consumption module is used for determining target data matched with user data statistical requirements from the real-time data through a real-time consumption task and consuming the target data into a data warehouse through the real-time consumption task, wherein the real-time consumption task acquires the user data statistical requirements from a pre-generated task configuration file, and the task configuration file comprises relevant parameters representing the user data statistical requirements.
Optionally, the apparatus further comprises:
the task configuration file generation module is used for acquiring relevant parameters representing the statistical requirements of the user data; acquiring a pre-generated task configuration template; and generating the task configuration file based on the related parameters and the task configuration template.
Optionally, the data consuming module is configured to:
consuming the target data to a temporary storage layer of the data warehouse through a real-time consumption task;
performing data cleaning on the target data of the temporary storage layer according to a preset time interval to obtain cleaned data, and storing the cleaned data to a detail layer of the data warehouse;
and responding to the user data statistical demand, summarizing the data stored in the detail layer based on at least one preset dimension to obtain summarized data, and storing the summarized data in a summarizing layer of the data warehouse.
Optionally, the data consumption module is configured to perform data cleaning on the target data of the temporary storage layer according to a preset time interval through a pre-generated offline task, so as to obtain cleaned data.
Optionally, the user data statistical requirements include statistical requirements of UV of independent visitors, and the data consumption module is configured to respond to the statistical requirements for the UV, perform summarization processing on the data stored in the detail layer based on the UV, and obtain summarized UV data.
Optionally, the summarized UV data is stored by using a wide table, and the apparatus further includes:
and the data updating module is used for updating corresponding indexes in the wide table according to the preset time interval, wherein the corresponding indexes are corresponding parameters used for counting the UV time of the target object.
Optionally, the data consumption module is configured to respond to a data statistics requirement of a user through a Spark distributed query engine, and perform summarization processing on data stored in the detail layer based on at least one preset dimension to obtain summarized data.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the steps in the method as provided in the first aspect are executed.
In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps in the method as provided in the first aspect.
Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic structural diagram of an electronic device for performing a real-time data processing method according to an embodiment of the present application;
fig. 2 is a flowchart of a real-time data processing method according to an embodiment of the present application;
fig. 3 is a block diagram of a real-time data processing apparatus according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
The embodiment of the application provides a real-time data processing method, the method determines target data matched with user data statistical requirements from real-time data through a real-time consumption task, and consumes the target data into a data warehouse, and the real-time consumption task directly obtains the user data statistical requirements from a task configuration file, so that under the condition that the user data statistical requirements are changed, only the task configuration file needs to be changed, a new real-time consumption task does not need to be generated every time, the time for consuming data can be effectively reduced, and the timeliness of the data is ensured.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device for executing a real-time data processing method according to an embodiment of the present application, where the electronic device may include: at least one processor 110, such as a CPU, at least one communication interface 120, at least one memory 130, and at least one communication bus 140. Wherein the communication bus 140 is used for realizing direct connection communication of these components. The communication interface 120 of the device in the embodiment of the present application is used for performing signaling or data communication with other node devices. The memory 130 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). Memory 130 may optionally be at least one memory device located remotely from the aforementioned processor. The memory 130 stores computer readable instructions, and when the computer readable instructions are executed by the processor 110, the electronic device executes the method process shown in fig. 2, for example, the memory 130 may be configured with a data warehouse for storing real-time data and/or target data, and the processor 110 is configured to consume the target data into the data warehouse through a real-time consumption task.
It will be appreciated that the configuration shown in fig. 1 is merely illustrative and that the electronic device may also include more or fewer components than shown in fig. 1 or have a different configuration than shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.
Referring to fig. 2, fig. 2 is a flowchart of a real-time data processing method according to an embodiment of the present application, where the method includes the following steps:
step S110: and acquiring real-time data.
Real-time data may be understood to mean data that is current in time, for example, in an embodiment where the user order data is obtained, the current time is eight am, and that real-time data is order data that is eight am.
It is understood that the real-time data may be obtained from various data sources, such as data stored in the Mysql database, or from a service system, taking the service system as an e-commerce website as an example, a user may perform a service operation, such as a commodity purchase operation, on the e-commerce website through a user terminal, and the e-commerce website sends the obtained real-time data generated when the user purchases a commodity to the electronic device.
Alternatively, the real-time data may be understood as real-time recorded data of an operation object, for example, the e-commerce website may also send real-time delivery amount of the commodity to the electronic device as real-time data, for example, in a commodity purchasing operation, the generated real-time data includes: commodity name, buyer information, seller information, purchase quantity, purchase time, receiving place, receiving mode, express mode, payment time, browsing time and the like.
Step S120: determining target data matched with user data statistical requirements from the real-time data through a real-time consumption task, and consuming the target data into a data warehouse through the real-time consumption task, wherein the real-time consumption task acquires the user data statistical requirements from a pre-generated task configuration file, and the task configuration file comprises relevant parameters representing the user data statistical requirements.
The real-time consumption task can be understood as a pre-written program which is used for executing tasks of data extraction and data consumption, and the real-time consumption task can be written in advance by a developer, can run in real time, and can also be started to run when the consumption task is needed.
The task configuration file comprises relevant parameters for representing the user data statistical requirements, and the relevant parameters can be understood as that the user can write the relevant parameters of the user data statistical requirements into the task configuration file. Because the real-time data comprises more data, a user may only need to acquire partial data for statistics, and relevant parameters of user data statistical requirements may include data that the user needs to perform statistics, calculation rules of data participation, and the like, the real-time consumption task may read the user data statistical requirements from the task configuration file in real time, and then acquire corresponding target data from the real-time data based on the user data statistical requirements, where the target data is data participating in statistics.
For example, the user may directly submit the calculation rule or the data required to be counted to the electronic device, the electronic device may convert the data into corresponding parameters and write the corresponding parameters into the task configuration file after obtaining the data, and the real-time consumption task may know target data matching with the statistical requirements of the user data after obtaining the relevant parameters from the task configuration file.
After the target data are obtained, the target data are consumed into the data warehouse in real time through the real-time consumption task to be stored, and a user can inquire the needed target data from the data warehouse.
In the implementation process, the target data matched with the user data statistical requirement is determined from the real-time data through the real-time consumption task, and the target data is consumed in the data warehouse.
As an implementation manner, in order to facilitate obtaining the relevant parameters representing the statistical requirements of the user data, a task configuration template may be generated in advance, and the task configuration file may be generated by the task configuration module, that is, the relevant parameters representing the statistical requirements of the user data may be obtained first, then the pre-generated task configuration template may be obtained, and then the task configuration file may be generated based on the relevant parameters and the task configuration template.
The task configuration template refers to other data except for the relevant parameters for generating the task configuration file, that is, when the task configuration file is generated, only the relevant parameters representing the statistical requirements of the user data need to be input, the task configuration template does not need to be changed, and the relevant parameters and the task configuration template can generate the task configuration file. Therefore, under the condition that the statistical requirement of the user data is changed, only relevant parameters need to be changed, then a new task configuration file is generated, so that the real-time consumption task can read the relevant parameters from the new task configuration file, then target data is obtained from the real-time data based on the relevant parameters, and then the target data is consumed in the data warehouse.
In the implementation process, the task configuration file is generated based on the relevant parameters and the task configuration module, so that under the condition that the user data statistics requirement is changed, only a new task configuration file needs to be generated, a real-time consumption task does not need to be re-developed, and further the resource consumption can be reduced.
The real-time data can comprise business data and user behavior data, the business data can be binlog log data, the binlog is a binary log maintained by a Mysql database and is mainly used for recording sql statements for updating the Mysql data or potentially updating the Mysql data, data increment synchronization can be achieved by using a working mechanism of the binlog, for example, the binlog log data of the Mysql database is read and analyzed, changed data is analyzed, the data is incremental data, the incremental data can be sent to a message queue for caching, and the message queue is a kafka queue.
The data warehouse may refer to a Distributed File System (HDFS), may store target data in the HDFS in a Hive form, where the Hive is a data warehouse tool based on a Hadoop (Distributed System infrastructure), may be used to perform data extraction, transformation, and loading, and is a mechanism that may store, query, and analyze large-scale data stored in the Hadoop, may map a structured data File into a database table, provides a simple sql query function, and may convert an sql statement into a MapReduce task for execution.
The data warehouse in the embodiment of the application is a real-time warehouse, which is generally divided into a multilayer structure from a model level, for example, the data warehouse comprises a temporary storage layer, a detail layer, a summary layer and an application layer, wherein the temporary storage layer is generally used for storing original data, the detail layer is used for defining a fact and dimension table according to a theme and storing fact data with the finest granularity, the summary layer is used for summarizing the data according to different business requirements, and the application layer can be used for responding to the operation of a user and outputting the data after corresponding processing.
Therefore, in the process of consuming the target data to the data warehouse, the target data can be consumed to a temporary storage layer of the data warehouse through a real-time consumption task, then the target data of the temporary storage layer is subjected to data cleaning according to a preset time interval to obtain cleaned data, the cleaned data is stored to a detail layer of the data warehouse, then the data stored in the detail layer is subjected to summarizing processing based on at least one dimension in response to the user data statistical requirements to obtain summarized data, and the summarized data is stored in a summarizing layer of the data warehouse.
It can be understood that the real-time consumption task may screen out target data from the real-time data, then consume the target data to a temporary storage layer of the data warehouse for temporary storage, and then may perform data cleaning on the target data according to a preset time interval, where the cleaning process includes data deletion, data format conversion, loading processing, and the like, and then store the cleaned data to the detail layer.
Wherein the at least one preset dimension may include, but is not limited to: commodity name, flow, order, time, delivery volume, user etc. the user can be according to actual demand, and the nimble predetermined dimension that sets up of user, so can gather data according to predetermined dimension.
As a real-time mode, the data can be submitted to a Spark distributed query engine, and the Spark distributed query engine is used for summarizing the data stored in the detail layer based on at least one preset dimension to obtain summarized data.
The Spark distributed query engine can perform operations such as real-time analysis, aggregation query, offline calculation and the like on data, is a quick and universal computing engine specially designed for large-scale data processing, is suitable for scenes that various personnel need various different distributed platforms, supports different computations through a unified framework, can simply and inexpensively integrate various processing flows, and can effectively improve the data processing efficiency.
In addition, in order to reduce resource consumption, as a real-time mode, data cleaning may be performed on the target data of the temporary storage layer at preset time intervals through a pre-generated offline task, so as to obtain cleaned data.
It can be understood that, under the condition of a small amount of data, if the data is cleaned in real time, resource consumption may be large, so an offline task may be configured in advance, and the target data is cleaned regularly according to a certain time interval by the offline task, for example, the preset time interval may be set to 15 minutes, that is, the target data is cleaned every 15 minutes, so that the target data can be cleaned after every certain time interval, and real-time resources may be effectively saved.
In addition, in the process of data cleaning, incremental data and historical data can be merged and rewritten into partitions, that is, the data is stored in different directories, such as an hour partition and a day partition, wherein the hour partition refers to summarizing the data in the directory every other hour, and the day partition refers to summarizing the data in the directory every other day, so that the data can be stored in different partitions according to actual requirements, and the data can be summarized according to different time intervals.
In the process of data summarization, corresponding data can be summarized according to user data statistical requirements, for example, the user data statistical requirements include statistical requirements of independent visitors (UV), so that the statistical requirements for the UV can be responded, data stored in detail layers are summarized based on the UV, and summarized UV data are obtained.
For example, the detail layer may store a plurality of order data of the commodities, and the order data includes the visiting user amount (i.e. UV) of each commodity, so that the visiting user amount of each commodity can be summarized for each commodity, and the total visiting user amount of each commodity can be obtained.
In order to facilitate the query of the UV data, the summarized UV data may be stored by using a wide table, and then, corresponding indexes in the wide table are updated according to a preset time interval, where the corresponding indexes are corresponding parameters used for counting the UV of the target object.
It can be understood that the wide table may include a plurality of commodities, each of the commodities corresponds to the amount of the visiting user, the target object of the commodity may be understood as a commodity, that is, the UV amount counted for each commodity, and the corresponding index may include whether the commodity has been delivered, the UV, the delivery amount, the delivery time, and the like of the commodity, after the increment data within the preset time interval is obtained, whether the indexes are updated may be determined based on the increment data, and if the indexes are updated, the corresponding index in the wide table is also updated, for example, the delivery amount is updated from 100 to 200.
However, since the same user may visit the same product many times, in the visiting user amount counting, in order to avoid repeated counting for the same user, the wide table is a wide table of one user dimension, that is, the wide table includes user information, so that when the same user visits the same product again, it can be known whether the user is the same user through the user information recorded in the wide table, if the user is the same user, the UV amount is not updated, and if the user is not the same user, the UV amount is increased by one.
In addition, in the updating process, one data table can be generated temporarily for the incremental data, and because Hive does not support direct updating in the original table, the newly generated data table can be merged with the original wide table, so that data updating of the wide table can be realized.
In the implementation process, the data are updated according to the preset time interval, and the incremental data in the preset time interval can be updated in time so as to collect and count the data in time.
Referring to fig. 3, fig. 3 is a block diagram of a real-time data processing apparatus 200 according to an embodiment of the present disclosure, where the apparatus 200 may be a module, a program segment, or code on an electronic device. It should be understood that the apparatus 200 corresponds to the above-mentioned embodiment of the method of fig. 2, and can perform various steps related to the embodiment of the method of fig. 2, and the specific functions of the apparatus 200 can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy.
Optionally, the apparatus 200 comprises:
a data obtaining module 210, configured to obtain real-time data;
the data consumption module 220 is configured to determine, by a real-time consumption task, target data matched with a user data statistical requirement from the real-time data, and consume, by the real-time consumption task, the target data in a data warehouse, where the real-time consumption task obtains the user data statistical requirement from a pre-generated task configuration file, and the task configuration file includes a relevant parameter representing the user data statistical requirement.
Optionally, the apparatus 200 further comprises:
the task configuration file generation module is used for acquiring relevant parameters representing the statistical requirements of the user data; acquiring a pre-generated task configuration template; and generating the task configuration file based on the related parameters and the task configuration template.
Optionally, the data consuming module 220 is configured to:
consuming the target data to a temporary storage layer of the data warehouse through a real-time consumption task;
performing data cleaning on the target data of the temporary storage layer according to a preset time interval to obtain cleaned data, and storing the cleaned data to a detail layer of the data warehouse;
and responding to the user data statistical demand, summarizing the data stored in the detail layer based on at least one preset dimension to obtain summarized data, and storing the summarized data in a summarizing layer of the data warehouse.
Optionally, the data consuming module 220 is configured to perform data cleaning on the target data of the temporary storage layer according to a preset time interval through a pre-generated offline task, so as to obtain cleaned data.
Optionally, the statistical requirement of the user data includes a statistical requirement of UV of an independent visitor, and the data consumption module 220 is configured to perform summary processing on the data stored in the detail layer based on UV in response to the statistical requirement of the UV, and obtain summarized UV data.
Optionally, the summarized UV data is stored by using a wide table, and the apparatus 200 further includes:
and the data updating module is used for updating corresponding indexes in the wide table according to the preset time interval, wherein the corresponding indexes are corresponding parameters used for counting the UV time of the target object.
Optionally, the data consumption module 220 is configured to respond to a data statistics requirement of a user through a Spark distributed query engine, and perform summarization processing on data stored in the detail layer based on at least one preset dimension to obtain summarized data.
The embodiment of the present application provides a readable storage medium, and when being executed by a processor, the computer program performs the method process performed by the electronic device in the method embodiment shown in fig. 2.
The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments, for example, comprising: acquiring real-time data; determining target data matched with user data statistical requirements from the real-time data through a real-time consumption task, and consuming the target data into a data warehouse through the real-time consumption task, wherein the real-time consumption task acquires the user data statistical requirements from a pre-generated task configuration file, and the task configuration file comprises relevant parameters representing the user data statistical requirements.
In summary, the embodiments of the present application provide a real-time data processing method, an apparatus, an electronic device, and a readable storage medium, where a real-time consumption task determines target data matching a user data statistical requirement from real-time data, and consumes the target data in a data warehouse, and the real-time consumption task directly obtains the user data statistical requirement from a task configuration file, so that when the user data statistical requirement changes, only the task configuration file needs to be changed, and a new real-time consumption task does not need to be generated each time, which can effectively reduce the time for consuming data, and ensure the timeliness of data.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (10)
1. A method of real-time data processing, the method comprising:
acquiring real-time data;
determining target data matched with user data statistical requirements from the real-time data through a real-time consumption task, and consuming the target data into a data warehouse through the real-time consumption task, wherein the real-time consumption task acquires the user data statistical requirements from a pre-generated task configuration file, and the task configuration file comprises relevant parameters representing the user data statistical requirements.
2. The method of claim 1, wherein prior to obtaining the real-time data, further comprising:
acquiring relevant parameters representing the statistical requirements of the user data;
acquiring a pre-generated task configuration template;
and generating the task configuration file based on the related parameters and the task configuration template.
3. The method of claim 1, wherein the consuming target data in the real-time data that matches the statistical requirements of the user data into a data warehouse by a real-time consumption task comprises:
consuming the target data to a temporary storage layer of the data warehouse through a real-time consumption task;
performing data cleaning on the target data of the temporary storage layer according to a preset time interval to obtain cleaned data, and storing the cleaned data to a detail layer of the data warehouse;
and responding to the user data statistical demand, summarizing the data stored in the detail layer based on at least one preset dimension to obtain summarized data, and storing the summarized data in a summarizing layer of the data warehouse.
4. The method according to claim 3, wherein the performing data cleaning on the target data of the temporary storage layer at preset time intervals to obtain cleaned data comprises:
and performing data cleaning on the target data of the temporary storage layer according to a preset time interval through a pre-generated offline task to obtain cleaned data.
5. The method of claim 3, wherein the statistical requirements of user data include statistical requirements of UV of independent visitors, and the summarizing the data stored in the detail layer based on at least one preset dimension in response to the statistical requirements of user data to obtain summarized data comprises:
and responding to the statistical requirement aiming at the UV, summarizing the data stored in the detail layer based on the UV, and obtaining summarized UV data.
6. The method of claim 5, wherein the aggregated UV data is stored using a broad table, the method further comprising:
and updating corresponding indexes in the wide table according to the preset time interval, wherein the corresponding indexes are corresponding parameters used for counting the UV time of the target object.
7. The method according to claim 3, wherein the step of summarizing the data stored in the detail layer based on at least one preset dimension in response to the statistical data demand of the user to obtain summarized data comprises:
responding to the data statistics requirement of a user through a Spark distributed query engine, and summarizing the data stored in the detail layer based on at least one preset dimension to obtain summarized data.
8. A real-time data processing apparatus, characterized in that the apparatus comprises:
the data acquisition module is used for acquiring real-time data;
the data consumption module is used for determining target data matched with user data statistical requirements from the real-time data through a real-time consumption task and consuming the target data into a data warehouse through the real-time consumption task, wherein the real-time consumption task acquires the user data statistical requirements from a pre-generated task configuration file, and the task configuration file comprises relevant parameters representing the user data statistical requirements.
9. An electronic device comprising a processor and a memory, the memory storing computer readable instructions that, when executed by the processor, perform the method of any of claims 1-7.
10. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010114107.4A CN111339073A (en) | 2020-02-24 | 2020-02-24 | Real-time data processing method and device, electronic equipment and readable storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202010114107.4A CN111339073A (en) | 2020-02-24 | 2020-02-24 | Real-time data processing method and device, electronic equipment and readable storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN111339073A true CN111339073A (en) | 2020-06-26 |
Family
ID=71183723
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202010114107.4A Pending CN111339073A (en) | 2020-02-24 | 2020-02-24 | Real-time data processing method and device, electronic equipment and readable storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111339073A (en) |
Cited By (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111784353A (en) * | 2020-07-02 | 2020-10-16 | 北京白龙马云行科技有限公司 | Real-time feature calculation method, order risk prediction method, device and order system |
| CN111966636A (en) * | 2020-08-20 | 2020-11-20 | 中国农业银行股份有限公司 | Data file fusion method and device |
| CN112000636A (en) * | 2020-08-31 | 2020-11-27 | 民生科技有限责任公司 | Statistical analysis method of user behavior based on Flink streaming processing |
| CN112100147A (en) * | 2020-07-27 | 2020-12-18 | 杭州玳数科技有限公司 | Method and system for realizing real-time acquisition from Bilog to HIVE based on Flink |
| CN112187407A (en) * | 2020-09-25 | 2021-01-05 | 中国移动通信集团黑龙江有限公司 | Real-time signaling message processing method, device, equipment and computer storage medium |
| CN112199351A (en) * | 2020-09-30 | 2021-01-08 | 澳优乳业(中国)有限公司 | Mobile sales data storage method and system, electronic equipment and storage medium |
| CN112667610A (en) * | 2020-12-23 | 2021-04-16 | 树根互联技术有限公司 | Real-time processing method and device for internet of things data and computer equipment |
| CN112785446A (en) * | 2021-01-26 | 2021-05-11 | 中国人寿保险股份有限公司上海数据中心 | Premium data self-correction real-time display method, system and storage medium |
| CN113110926A (en) * | 2021-04-19 | 2021-07-13 | 上海华兴数字科技有限公司 | Dynamic adjustment method, device, medium and electronic equipment for data consumption thread |
| CN113377829A (en) * | 2021-05-14 | 2021-09-10 | 中国民生银行股份有限公司 | Big data statistical method and device |
| CN113468246A (en) * | 2021-07-20 | 2021-10-01 | 上海齐屹信息科技有限公司 | Intelligent data counting and subscribing system and method based on OLTP |
| CN113806363A (en) * | 2021-08-24 | 2021-12-17 | 北京偶数科技有限公司 | Data processing method, device and storage medium |
| CN113886383A (en) * | 2021-08-26 | 2022-01-04 | 拉卡拉支付股份有限公司 | Data processing method, apparatus, electronic device, storage medium and program product |
| CN114510708A (en) * | 2021-12-28 | 2022-05-17 | 奇安信科技集团股份有限公司 | Real-time data warehouse construction, abnormal detection method, device, equipment and product |
| CN115033556A (en) * | 2022-06-30 | 2022-09-09 | 唯品会(珠海)电子商务有限公司 | Data processing method, data processing device, storage medium and computer equipment |
| CN116069600A (en) * | 2021-11-03 | 2023-05-05 | 中国移动通信集团设计院有限公司 | Method, device, equipment and storage medium for analyzing data in real time |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108427711A (en) * | 2018-01-31 | 2018-08-21 | 北京三快在线科技有限公司 | Real-time data warehouse, real-time data processing method, electronic equipment and storage medium |
| CN108492150A (en) * | 2018-04-11 | 2018-09-04 | 口碑(上海)信息技术有限公司 | The determination method and system of entity temperature |
| CN109284334A (en) * | 2018-09-05 | 2019-01-29 | 拉扎斯网络科技(上海)有限公司 | Real-time database synchronization method and device, electronic equipment and storage medium |
| CN109597842A (en) * | 2018-12-14 | 2019-04-09 | 深圳前海微众银行股份有限公司 | Data real-time computing technique, device, equipment and computer readable storage medium |
| CN109753531A (en) * | 2018-12-26 | 2019-05-14 | 深圳市麦谷科技有限公司 | A kind of big data statistical method, system, computer equipment and storage medium |
| CN110019397A (en) * | 2017-12-06 | 2019-07-16 | 北京京东尚科信息技术有限公司 | For carrying out the method and device of data processing |
| CN110197331A (en) * | 2019-05-24 | 2019-09-03 | 深圳前海微众银行股份有限公司 | Business data processing method, device, equipment and computer readable storage medium |
-
2020
- 2020-02-24 CN CN202010114107.4A patent/CN111339073A/en active Pending
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110019397A (en) * | 2017-12-06 | 2019-07-16 | 北京京东尚科信息技术有限公司 | For carrying out the method and device of data processing |
| CN108427711A (en) * | 2018-01-31 | 2018-08-21 | 北京三快在线科技有限公司 | Real-time data warehouse, real-time data processing method, electronic equipment and storage medium |
| CN108492150A (en) * | 2018-04-11 | 2018-09-04 | 口碑(上海)信息技术有限公司 | The determination method and system of entity temperature |
| CN109284334A (en) * | 2018-09-05 | 2019-01-29 | 拉扎斯网络科技(上海)有限公司 | Real-time database synchronization method and device, electronic equipment and storage medium |
| CN109597842A (en) * | 2018-12-14 | 2019-04-09 | 深圳前海微众银行股份有限公司 | Data real-time computing technique, device, equipment and computer readable storage medium |
| CN109753531A (en) * | 2018-12-26 | 2019-05-14 | 深圳市麦谷科技有限公司 | A kind of big data statistical method, system, computer equipment and storage medium |
| CN110197331A (en) * | 2019-05-24 | 2019-09-03 | 深圳前海微众银行股份有限公司 | Business data processing method, device, equipment and computer readable storage medium |
Cited By (20)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111784353A (en) * | 2020-07-02 | 2020-10-16 | 北京白龙马云行科技有限公司 | Real-time feature calculation method, order risk prediction method, device and order system |
| CN111784353B (en) * | 2020-07-02 | 2024-01-30 | 北京白龙马云行科技有限公司 | Real-time feature calculation method, order risk prediction device and order system |
| CN112100147A (en) * | 2020-07-27 | 2020-12-18 | 杭州玳数科技有限公司 | Method and system for realizing real-time acquisition from Bilog to HIVE based on Flink |
| CN112100147B (en) * | 2020-07-27 | 2024-06-07 | 杭州玳数科技有限公司 | Method and system for realizing real-time acquisition from Binlog to HIVE based on Flink |
| CN111966636A (en) * | 2020-08-20 | 2020-11-20 | 中国农业银行股份有限公司 | Data file fusion method and device |
| CN112000636A (en) * | 2020-08-31 | 2020-11-27 | 民生科技有限责任公司 | Statistical analysis method of user behavior based on Flink streaming processing |
| CN112187407A (en) * | 2020-09-25 | 2021-01-05 | 中国移动通信集团黑龙江有限公司 | Real-time signaling message processing method, device, equipment and computer storage medium |
| CN112199351A (en) * | 2020-09-30 | 2021-01-08 | 澳优乳业(中国)有限公司 | Mobile sales data storage method and system, electronic equipment and storage medium |
| CN112667610A (en) * | 2020-12-23 | 2021-04-16 | 树根互联技术有限公司 | Real-time processing method and device for internet of things data and computer equipment |
| CN112785446A (en) * | 2021-01-26 | 2021-05-11 | 中国人寿保险股份有限公司上海数据中心 | Premium data self-correction real-time display method, system and storage medium |
| CN113110926A (en) * | 2021-04-19 | 2021-07-13 | 上海华兴数字科技有限公司 | Dynamic adjustment method, device, medium and electronic equipment for data consumption thread |
| CN113110926B (en) * | 2021-04-19 | 2024-12-27 | 上海华兴数字科技有限公司 | Method, device, medium and electronic device for dynamically adjusting data consumption thread |
| CN113377829A (en) * | 2021-05-14 | 2021-09-10 | 中国民生银行股份有限公司 | Big data statistical method and device |
| CN113468246A (en) * | 2021-07-20 | 2021-10-01 | 上海齐屹信息科技有限公司 | Intelligent data counting and subscribing system and method based on OLTP |
| CN113806363A (en) * | 2021-08-24 | 2021-12-17 | 北京偶数科技有限公司 | Data processing method, device and storage medium |
| CN113886383A (en) * | 2021-08-26 | 2022-01-04 | 拉卡拉支付股份有限公司 | Data processing method, apparatus, electronic device, storage medium and program product |
| CN116069600A (en) * | 2021-11-03 | 2023-05-05 | 中国移动通信集团设计院有限公司 | Method, device, equipment and storage medium for analyzing data in real time |
| CN114510708A (en) * | 2021-12-28 | 2022-05-17 | 奇安信科技集团股份有限公司 | Real-time data warehouse construction, abnormal detection method, device, equipment and product |
| CN114510708B (en) * | 2021-12-28 | 2025-02-18 | 奇安信科技集团股份有限公司 | Real-time data warehouse construction, anomaly detection methods, devices, equipment and products |
| CN115033556A (en) * | 2022-06-30 | 2022-09-09 | 唯品会(珠海)电子商务有限公司 | Data processing method, data processing device, storage medium and computer equipment |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111339073A (en) | Real-time data processing method and device, electronic equipment and readable storage medium | |
| CN111209352B (en) | Data processing method and device, electronic equipment and storage medium | |
| US10122788B2 (en) | Managed function execution for processing data streams in real time | |
| CN112988741B (en) | Real-time service data merging method and device and electronic equipment | |
| CN112506486B (en) | Search system establishment method, device, electronic device and readable storage medium | |
| CN111444158B (en) | Long-term and short-term user portrait generation method, device, equipment and readable storage medium | |
| CN113076729B (en) | Method and system for importing report, readable storage medium and electronic equipment | |
| CN105989163A (en) | Data real-time processing method and system | |
| CN114490137A (en) | Method, device, electronic device and readable storage medium for real-time statistics of business data | |
| CN111290910A (en) | Log processing method, device, server and storage medium | |
| CN112685456A (en) | User access data processing method and device and computer system | |
| CN114036174B (en) | Data updating method, device, equipment and storage medium | |
| CN119359413B (en) | E-commerce order analysis and regulation method and system | |
| CN115617480A (en) | Task scheduling method, device and system and storage medium | |
| CN108696559B (en) | Stream processing method and device | |
| CN117131059A (en) | Report data processing methods, devices, equipment and storage media | |
| CN113656369A (en) | Log distributed streaming acquisition and calculation method in big data scene | |
| KR20220115859A (en) | Edge table representation of the process | |
| CN118261703A (en) | Full-link transaction view construction method and device, electronic equipment and storage medium | |
| CN112148762A (en) | Statistical method and device for real-time data flow | |
| CN116756115A (en) | Power grid multi-temporal modeling method, device and electronic equipment based on graph database | |
| CN115328702A (en) | Log data backup method, device, electronic device and storage medium | |
| CN114490087A (en) | Method and device for acquiring available time of server cluster, electronic equipment and medium | |
| CN113836411A (en) | Data processing method and device and computer equipment | |
| CN113760836A (en) | Wide table calculation method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200626 |