[go: up one dir, main page]

CN111339073A - Real-time data processing method and device, electronic equipment and readable storage medium - Google Patents

Real-time data processing method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN111339073A
CN111339073A CN202010114107.4A CN202010114107A CN111339073A CN 111339073 A CN111339073 A CN 111339073A CN 202010114107 A CN202010114107 A CN 202010114107A CN 111339073 A CN111339073 A CN 111339073A
Authority
CN
China
Prior art keywords
data
real
time
task
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010114107.4A
Other languages
Chinese (zh)
Inventor
李大学
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Manyun Software Technology Co Ltd
Original Assignee
Tianjin Manyun Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Manyun Software Technology Co Ltd filed Critical Tianjin Manyun Software Technology Co Ltd
Priority to CN202010114107.4A priority Critical patent/CN111339073A/en
Publication of CN111339073A publication Critical patent/CN111339073A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a real-time data processing method and device, electronic equipment and a readable storage medium, and relates to the technical field of data processing. The method comprises the following steps: acquiring real-time data; determining target data matched with user data statistical requirements from the real-time data through a real-time consumption task, and consuming the target data into a data warehouse through the real-time consumption task, wherein the real-time consumption task acquires the user data statistical requirements from a pre-generated task configuration file, and the task configuration file comprises relevant parameters representing the user data statistical requirements. In the scheme, the real-time consumption task directly acquires the user data statistical requirement from the task configuration file, so that under the condition that the user data statistical requirement is changed, only the task configuration file needs to be changed, a new real-time consumption task does not need to be generated every time, the time for consuming data can be effectively reduced, and the timeliness of the data is ensured.

Description

Real-time data processing method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a real-time data processing method and apparatus, an electronic device, and a readable storage medium.
Background
With the development of science and technology, the wide application of big data technology makes it become the key supporting technology that leads the technological progress of numerous industries and promotes the benefit to increase. According to the timeliness of data processing, big data processing systems can be divided into batch big data processing and streaming big data processing. Among them, streaming big data is also called real-time big data.
At present, widely applied stream type big data processing systems include Storm, Flink and the like, and real-time data are loaded into a high-performance memory database one by one for query through stream processing.
In order to meet different requirements of users on data and consume different data into a data warehouse in real time, developers need to develop a real-time task to consume corresponding data into the data warehouse according to different user requirements, but under the condition of large data requirements, the developers need to develop a plurality of real-time tasks, the development of the real-time tasks needs more time, and therefore the data cannot be consumed into the data warehouse in time, namely the timeliness of the data cannot be ensured.
Disclosure of Invention
An object of the embodiments of the present application is to provide a real-time data processing method, an apparatus, an electronic device, and a readable storage medium, so as to solve the problem in the prior art that different consumption tasks need to be developed to consume data in a data warehouse according to different user requirements, which results in untimely data consumption and failure to ensure timeliness of the data.
In a first aspect, an embodiment of the present application provides a real-time data processing method, where the method includes: acquiring real-time data; determining target data matched with user data statistical requirements from the real-time data through a real-time consumption task, and consuming the target data into a data warehouse through the real-time consumption task, wherein the real-time consumption task acquires the user data statistical requirements from a pre-generated task configuration file, and the task configuration file comprises relevant parameters representing the user data statistical requirements.
In the implementation process, the target data matched with the user data statistical requirement is determined from the real-time data through the real-time consumption task, and the target data is consumed in the data warehouse.
Optionally, before the acquiring the real-time data, the method further includes:
acquiring relevant parameters representing the statistical requirements of the user data;
acquiring a pre-generated task configuration template;
and generating the task configuration file based on the related parameters and the task configuration template.
In the implementation process, the task configuration file is generated based on the relevant parameters and the task configuration module, so that under the condition that the user data statistics requirement is changed, only a new task configuration file needs to be generated, a real-time consumption task does not need to be re-developed, and further the resource consumption can be reduced.
Optionally, the consuming, by the real-time consumption task, target data in the real-time data, which is matched with the statistical requirement of the user data, into a data warehouse includes:
consuming the target data to a temporary storage layer of the data warehouse through a real-time consumption task;
performing data cleaning on the target data of the temporary storage layer according to a preset time interval to obtain cleaned data, and storing the cleaned data to a detail layer of the data warehouse;
and responding to the user data statistical demand, summarizing the data stored in the detail layer based on at least one preset dimension to obtain summarized data, and storing the summarized data in a summarizing layer of the data warehouse.
In the implementation process, different data are stored in different layers of the data warehouse, so that the data can be processed in a layered mode conveniently.
Optionally, the performing data cleaning on the target data of the temporary storage layer according to a preset time interval to obtain cleaned data includes:
and performing data cleaning on the target data of the temporary storage layer according to a preset time interval through a pre-generated offline task to obtain cleaned data.
In the implementation process, under the condition of less data quantity, the resource consumption can be reduced by cleaning the data through the off-line task.
Optionally, the user data statistics requirement includes a statistics requirement of an independent visitor UV, the data statistics requirement of the response user is summarized based on at least one preset dimension on the data stored in the detail layer, and the summarized data is obtained, including:
and responding to the statistical requirement aiming at the UV, summarizing the data stored in the detail layer based on the UV, and obtaining summarized UV data.
In the implementation process, the UV data is stored through the wide table, so that the query of the UV data can be facilitated.
Optionally, the summarized UV data is stored using a wide table, and the method further includes:
and updating corresponding indexes in the wide table according to the preset time interval, wherein the corresponding indexes are corresponding parameters used for counting the UV time of the target object.
In the implementation process, the data are updated according to the preset time interval, and the incremental data in the preset time interval can be updated in time so as to collect and count the data in time.
Optionally, the responding to the data statistics requirement of the user, performing summary processing on the data stored in the detail layer based on at least one preset dimension, and obtaining the summarized data includes:
responding to the data statistics requirement of a user through a Spark distributed query engine, and summarizing the data stored in the detail layer based on at least one preset dimension to obtain summarized data.
In the implementation process, the Spark distributed query engine is a fast and general computing engine specially designed for large-scale data processing, and the Spark distributed query engine is used for summarizing data, so that the data processing efficiency can be effectively improved.
In a second aspect, an embodiment of the present application provides a real-time data processing apparatus, where the apparatus includes:
the data acquisition module is used for acquiring real-time data;
the data consumption module is used for determining target data matched with user data statistical requirements from the real-time data through a real-time consumption task and consuming the target data into a data warehouse through the real-time consumption task, wherein the real-time consumption task acquires the user data statistical requirements from a pre-generated task configuration file, and the task configuration file comprises relevant parameters representing the user data statistical requirements.
Optionally, the apparatus further comprises:
the task configuration file generation module is used for acquiring relevant parameters representing the statistical requirements of the user data; acquiring a pre-generated task configuration template; and generating the task configuration file based on the related parameters and the task configuration template.
Optionally, the data consuming module is configured to:
consuming the target data to a temporary storage layer of the data warehouse through a real-time consumption task;
performing data cleaning on the target data of the temporary storage layer according to a preset time interval to obtain cleaned data, and storing the cleaned data to a detail layer of the data warehouse;
and responding to the user data statistical demand, summarizing the data stored in the detail layer based on at least one preset dimension to obtain summarized data, and storing the summarized data in a summarizing layer of the data warehouse.
Optionally, the data consumption module is configured to perform data cleaning on the target data of the temporary storage layer according to a preset time interval through a pre-generated offline task, so as to obtain cleaned data.
Optionally, the user data statistical requirements include statistical requirements of UV of independent visitors, and the data consumption module is configured to respond to the statistical requirements for the UV, perform summarization processing on the data stored in the detail layer based on the UV, and obtain summarized UV data.
Optionally, the summarized UV data is stored by using a wide table, and the apparatus further includes:
and the data updating module is used for updating corresponding indexes in the wide table according to the preset time interval, wherein the corresponding indexes are corresponding parameters used for counting the UV time of the target object.
Optionally, the data consumption module is configured to respond to a data statistics requirement of a user through a Spark distributed query engine, and perform summarization processing on data stored in the detail layer based on at least one preset dimension to obtain summarized data.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the steps in the method as provided in the first aspect are executed.
In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps in the method as provided in the first aspect.
Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic structural diagram of an electronic device for performing a real-time data processing method according to an embodiment of the present application;
fig. 2 is a flowchart of a real-time data processing method according to an embodiment of the present application;
fig. 3 is a block diagram of a real-time data processing apparatus according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
The embodiment of the application provides a real-time data processing method, the method determines target data matched with user data statistical requirements from real-time data through a real-time consumption task, and consumes the target data into a data warehouse, and the real-time consumption task directly obtains the user data statistical requirements from a task configuration file, so that under the condition that the user data statistical requirements are changed, only the task configuration file needs to be changed, a new real-time consumption task does not need to be generated every time, the time for consuming data can be effectively reduced, and the timeliness of the data is ensured.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device for executing a real-time data processing method according to an embodiment of the present application, where the electronic device may include: at least one processor 110, such as a CPU, at least one communication interface 120, at least one memory 130, and at least one communication bus 140. Wherein the communication bus 140 is used for realizing direct connection communication of these components. The communication interface 120 of the device in the embodiment of the present application is used for performing signaling or data communication with other node devices. The memory 130 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). Memory 130 may optionally be at least one memory device located remotely from the aforementioned processor. The memory 130 stores computer readable instructions, and when the computer readable instructions are executed by the processor 110, the electronic device executes the method process shown in fig. 2, for example, the memory 130 may be configured with a data warehouse for storing real-time data and/or target data, and the processor 110 is configured to consume the target data into the data warehouse through a real-time consumption task.
It will be appreciated that the configuration shown in fig. 1 is merely illustrative and that the electronic device may also include more or fewer components than shown in fig. 1 or have a different configuration than shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.
Referring to fig. 2, fig. 2 is a flowchart of a real-time data processing method according to an embodiment of the present application, where the method includes the following steps:
step S110: and acquiring real-time data.
Real-time data may be understood to mean data that is current in time, for example, in an embodiment where the user order data is obtained, the current time is eight am, and that real-time data is order data that is eight am.
It is understood that the real-time data may be obtained from various data sources, such as data stored in the Mysql database, or from a service system, taking the service system as an e-commerce website as an example, a user may perform a service operation, such as a commodity purchase operation, on the e-commerce website through a user terminal, and the e-commerce website sends the obtained real-time data generated when the user purchases a commodity to the electronic device.
Alternatively, the real-time data may be understood as real-time recorded data of an operation object, for example, the e-commerce website may also send real-time delivery amount of the commodity to the electronic device as real-time data, for example, in a commodity purchasing operation, the generated real-time data includes: commodity name, buyer information, seller information, purchase quantity, purchase time, receiving place, receiving mode, express mode, payment time, browsing time and the like.
Step S120: determining target data matched with user data statistical requirements from the real-time data through a real-time consumption task, and consuming the target data into a data warehouse through the real-time consumption task, wherein the real-time consumption task acquires the user data statistical requirements from a pre-generated task configuration file, and the task configuration file comprises relevant parameters representing the user data statistical requirements.
The real-time consumption task can be understood as a pre-written program which is used for executing tasks of data extraction and data consumption, and the real-time consumption task can be written in advance by a developer, can run in real time, and can also be started to run when the consumption task is needed.
The task configuration file comprises relevant parameters for representing the user data statistical requirements, and the relevant parameters can be understood as that the user can write the relevant parameters of the user data statistical requirements into the task configuration file. Because the real-time data comprises more data, a user may only need to acquire partial data for statistics, and relevant parameters of user data statistical requirements may include data that the user needs to perform statistics, calculation rules of data participation, and the like, the real-time consumption task may read the user data statistical requirements from the task configuration file in real time, and then acquire corresponding target data from the real-time data based on the user data statistical requirements, where the target data is data participating in statistics.
For example, the user may directly submit the calculation rule or the data required to be counted to the electronic device, the electronic device may convert the data into corresponding parameters and write the corresponding parameters into the task configuration file after obtaining the data, and the real-time consumption task may know target data matching with the statistical requirements of the user data after obtaining the relevant parameters from the task configuration file.
After the target data are obtained, the target data are consumed into the data warehouse in real time through the real-time consumption task to be stored, and a user can inquire the needed target data from the data warehouse.
In the implementation process, the target data matched with the user data statistical requirement is determined from the real-time data through the real-time consumption task, and the target data is consumed in the data warehouse.
As an implementation manner, in order to facilitate obtaining the relevant parameters representing the statistical requirements of the user data, a task configuration template may be generated in advance, and the task configuration file may be generated by the task configuration module, that is, the relevant parameters representing the statistical requirements of the user data may be obtained first, then the pre-generated task configuration template may be obtained, and then the task configuration file may be generated based on the relevant parameters and the task configuration template.
The task configuration template refers to other data except for the relevant parameters for generating the task configuration file, that is, when the task configuration file is generated, only the relevant parameters representing the statistical requirements of the user data need to be input, the task configuration template does not need to be changed, and the relevant parameters and the task configuration template can generate the task configuration file. Therefore, under the condition that the statistical requirement of the user data is changed, only relevant parameters need to be changed, then a new task configuration file is generated, so that the real-time consumption task can read the relevant parameters from the new task configuration file, then target data is obtained from the real-time data based on the relevant parameters, and then the target data is consumed in the data warehouse.
In the implementation process, the task configuration file is generated based on the relevant parameters and the task configuration module, so that under the condition that the user data statistics requirement is changed, only a new task configuration file needs to be generated, a real-time consumption task does not need to be re-developed, and further the resource consumption can be reduced.
The real-time data can comprise business data and user behavior data, the business data can be binlog log data, the binlog is a binary log maintained by a Mysql database and is mainly used for recording sql statements for updating the Mysql data or potentially updating the Mysql data, data increment synchronization can be achieved by using a working mechanism of the binlog, for example, the binlog log data of the Mysql database is read and analyzed, changed data is analyzed, the data is incremental data, the incremental data can be sent to a message queue for caching, and the message queue is a kafka queue.
The data warehouse may refer to a Distributed File System (HDFS), may store target data in the HDFS in a Hive form, where the Hive is a data warehouse tool based on a Hadoop (Distributed System infrastructure), may be used to perform data extraction, transformation, and loading, and is a mechanism that may store, query, and analyze large-scale data stored in the Hadoop, may map a structured data File into a database table, provides a simple sql query function, and may convert an sql statement into a MapReduce task for execution.
The data warehouse in the embodiment of the application is a real-time warehouse, which is generally divided into a multilayer structure from a model level, for example, the data warehouse comprises a temporary storage layer, a detail layer, a summary layer and an application layer, wherein the temporary storage layer is generally used for storing original data, the detail layer is used for defining a fact and dimension table according to a theme and storing fact data with the finest granularity, the summary layer is used for summarizing the data according to different business requirements, and the application layer can be used for responding to the operation of a user and outputting the data after corresponding processing.
Therefore, in the process of consuming the target data to the data warehouse, the target data can be consumed to a temporary storage layer of the data warehouse through a real-time consumption task, then the target data of the temporary storage layer is subjected to data cleaning according to a preset time interval to obtain cleaned data, the cleaned data is stored to a detail layer of the data warehouse, then the data stored in the detail layer is subjected to summarizing processing based on at least one dimension in response to the user data statistical requirements to obtain summarized data, and the summarized data is stored in a summarizing layer of the data warehouse.
It can be understood that the real-time consumption task may screen out target data from the real-time data, then consume the target data to a temporary storage layer of the data warehouse for temporary storage, and then may perform data cleaning on the target data according to a preset time interval, where the cleaning process includes data deletion, data format conversion, loading processing, and the like, and then store the cleaned data to the detail layer.
Wherein the at least one preset dimension may include, but is not limited to: commodity name, flow, order, time, delivery volume, user etc. the user can be according to actual demand, and the nimble predetermined dimension that sets up of user, so can gather data according to predetermined dimension.
As a real-time mode, the data can be submitted to a Spark distributed query engine, and the Spark distributed query engine is used for summarizing the data stored in the detail layer based on at least one preset dimension to obtain summarized data.
The Spark distributed query engine can perform operations such as real-time analysis, aggregation query, offline calculation and the like on data, is a quick and universal computing engine specially designed for large-scale data processing, is suitable for scenes that various personnel need various different distributed platforms, supports different computations through a unified framework, can simply and inexpensively integrate various processing flows, and can effectively improve the data processing efficiency.
In addition, in order to reduce resource consumption, as a real-time mode, data cleaning may be performed on the target data of the temporary storage layer at preset time intervals through a pre-generated offline task, so as to obtain cleaned data.
It can be understood that, under the condition of a small amount of data, if the data is cleaned in real time, resource consumption may be large, so an offline task may be configured in advance, and the target data is cleaned regularly according to a certain time interval by the offline task, for example, the preset time interval may be set to 15 minutes, that is, the target data is cleaned every 15 minutes, so that the target data can be cleaned after every certain time interval, and real-time resources may be effectively saved.
In addition, in the process of data cleaning, incremental data and historical data can be merged and rewritten into partitions, that is, the data is stored in different directories, such as an hour partition and a day partition, wherein the hour partition refers to summarizing the data in the directory every other hour, and the day partition refers to summarizing the data in the directory every other day, so that the data can be stored in different partitions according to actual requirements, and the data can be summarized according to different time intervals.
In the process of data summarization, corresponding data can be summarized according to user data statistical requirements, for example, the user data statistical requirements include statistical requirements of independent visitors (UV), so that the statistical requirements for the UV can be responded, data stored in detail layers are summarized based on the UV, and summarized UV data are obtained.
For example, the detail layer may store a plurality of order data of the commodities, and the order data includes the visiting user amount (i.e. UV) of each commodity, so that the visiting user amount of each commodity can be summarized for each commodity, and the total visiting user amount of each commodity can be obtained.
In order to facilitate the query of the UV data, the summarized UV data may be stored by using a wide table, and then, corresponding indexes in the wide table are updated according to a preset time interval, where the corresponding indexes are corresponding parameters used for counting the UV of the target object.
It can be understood that the wide table may include a plurality of commodities, each of the commodities corresponds to the amount of the visiting user, the target object of the commodity may be understood as a commodity, that is, the UV amount counted for each commodity, and the corresponding index may include whether the commodity has been delivered, the UV, the delivery amount, the delivery time, and the like of the commodity, after the increment data within the preset time interval is obtained, whether the indexes are updated may be determined based on the increment data, and if the indexes are updated, the corresponding index in the wide table is also updated, for example, the delivery amount is updated from 100 to 200.
However, since the same user may visit the same product many times, in the visiting user amount counting, in order to avoid repeated counting for the same user, the wide table is a wide table of one user dimension, that is, the wide table includes user information, so that when the same user visits the same product again, it can be known whether the user is the same user through the user information recorded in the wide table, if the user is the same user, the UV amount is not updated, and if the user is not the same user, the UV amount is increased by one.
In addition, in the updating process, one data table can be generated temporarily for the incremental data, and because Hive does not support direct updating in the original table, the newly generated data table can be merged with the original wide table, so that data updating of the wide table can be realized.
In the implementation process, the data are updated according to the preset time interval, and the incremental data in the preset time interval can be updated in time so as to collect and count the data in time.
Referring to fig. 3, fig. 3 is a block diagram of a real-time data processing apparatus 200 according to an embodiment of the present disclosure, where the apparatus 200 may be a module, a program segment, or code on an electronic device. It should be understood that the apparatus 200 corresponds to the above-mentioned embodiment of the method of fig. 2, and can perform various steps related to the embodiment of the method of fig. 2, and the specific functions of the apparatus 200 can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy.
Optionally, the apparatus 200 comprises:
a data obtaining module 210, configured to obtain real-time data;
the data consumption module 220 is configured to determine, by a real-time consumption task, target data matched with a user data statistical requirement from the real-time data, and consume, by the real-time consumption task, the target data in a data warehouse, where the real-time consumption task obtains the user data statistical requirement from a pre-generated task configuration file, and the task configuration file includes a relevant parameter representing the user data statistical requirement.
Optionally, the apparatus 200 further comprises:
the task configuration file generation module is used for acquiring relevant parameters representing the statistical requirements of the user data; acquiring a pre-generated task configuration template; and generating the task configuration file based on the related parameters and the task configuration template.
Optionally, the data consuming module 220 is configured to:
consuming the target data to a temporary storage layer of the data warehouse through a real-time consumption task;
performing data cleaning on the target data of the temporary storage layer according to a preset time interval to obtain cleaned data, and storing the cleaned data to a detail layer of the data warehouse;
and responding to the user data statistical demand, summarizing the data stored in the detail layer based on at least one preset dimension to obtain summarized data, and storing the summarized data in a summarizing layer of the data warehouse.
Optionally, the data consuming module 220 is configured to perform data cleaning on the target data of the temporary storage layer according to a preset time interval through a pre-generated offline task, so as to obtain cleaned data.
Optionally, the statistical requirement of the user data includes a statistical requirement of UV of an independent visitor, and the data consumption module 220 is configured to perform summary processing on the data stored in the detail layer based on UV in response to the statistical requirement of the UV, and obtain summarized UV data.
Optionally, the summarized UV data is stored by using a wide table, and the apparatus 200 further includes:
and the data updating module is used for updating corresponding indexes in the wide table according to the preset time interval, wherein the corresponding indexes are corresponding parameters used for counting the UV time of the target object.
Optionally, the data consumption module 220 is configured to respond to a data statistics requirement of a user through a Spark distributed query engine, and perform summarization processing on data stored in the detail layer based on at least one preset dimension to obtain summarized data.
The embodiment of the present application provides a readable storage medium, and when being executed by a processor, the computer program performs the method process performed by the electronic device in the method embodiment shown in fig. 2.
The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments, for example, comprising: acquiring real-time data; determining target data matched with user data statistical requirements from the real-time data through a real-time consumption task, and consuming the target data into a data warehouse through the real-time consumption task, wherein the real-time consumption task acquires the user data statistical requirements from a pre-generated task configuration file, and the task configuration file comprises relevant parameters representing the user data statistical requirements.
In summary, the embodiments of the present application provide a real-time data processing method, an apparatus, an electronic device, and a readable storage medium, where a real-time consumption task determines target data matching a user data statistical requirement from real-time data, and consumes the target data in a data warehouse, and the real-time consumption task directly obtains the user data statistical requirement from a task configuration file, so that when the user data statistical requirement changes, only the task configuration file needs to be changed, and a new real-time consumption task does not need to be generated each time, which can effectively reduce the time for consuming data, and ensure the timeliness of data.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method of real-time data processing, the method comprising:
acquiring real-time data;
determining target data matched with user data statistical requirements from the real-time data through a real-time consumption task, and consuming the target data into a data warehouse through the real-time consumption task, wherein the real-time consumption task acquires the user data statistical requirements from a pre-generated task configuration file, and the task configuration file comprises relevant parameters representing the user data statistical requirements.
2. The method of claim 1, wherein prior to obtaining the real-time data, further comprising:
acquiring relevant parameters representing the statistical requirements of the user data;
acquiring a pre-generated task configuration template;
and generating the task configuration file based on the related parameters and the task configuration template.
3. The method of claim 1, wherein the consuming target data in the real-time data that matches the statistical requirements of the user data into a data warehouse by a real-time consumption task comprises:
consuming the target data to a temporary storage layer of the data warehouse through a real-time consumption task;
performing data cleaning on the target data of the temporary storage layer according to a preset time interval to obtain cleaned data, and storing the cleaned data to a detail layer of the data warehouse;
and responding to the user data statistical demand, summarizing the data stored in the detail layer based on at least one preset dimension to obtain summarized data, and storing the summarized data in a summarizing layer of the data warehouse.
4. The method according to claim 3, wherein the performing data cleaning on the target data of the temporary storage layer at preset time intervals to obtain cleaned data comprises:
and performing data cleaning on the target data of the temporary storage layer according to a preset time interval through a pre-generated offline task to obtain cleaned data.
5. The method of claim 3, wherein the statistical requirements of user data include statistical requirements of UV of independent visitors, and the summarizing the data stored in the detail layer based on at least one preset dimension in response to the statistical requirements of user data to obtain summarized data comprises:
and responding to the statistical requirement aiming at the UV, summarizing the data stored in the detail layer based on the UV, and obtaining summarized UV data.
6. The method of claim 5, wherein the aggregated UV data is stored using a broad table, the method further comprising:
and updating corresponding indexes in the wide table according to the preset time interval, wherein the corresponding indexes are corresponding parameters used for counting the UV time of the target object.
7. The method according to claim 3, wherein the step of summarizing the data stored in the detail layer based on at least one preset dimension in response to the statistical data demand of the user to obtain summarized data comprises:
responding to the data statistics requirement of a user through a Spark distributed query engine, and summarizing the data stored in the detail layer based on at least one preset dimension to obtain summarized data.
8. A real-time data processing apparatus, characterized in that the apparatus comprises:
the data acquisition module is used for acquiring real-time data;
the data consumption module is used for determining target data matched with user data statistical requirements from the real-time data through a real-time consumption task and consuming the target data into a data warehouse through the real-time consumption task, wherein the real-time consumption task acquires the user data statistical requirements from a pre-generated task configuration file, and the task configuration file comprises relevant parameters representing the user data statistical requirements.
9. An electronic device comprising a processor and a memory, the memory storing computer readable instructions that, when executed by the processor, perform the method of any of claims 1-7.
10. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202010114107.4A 2020-02-24 2020-02-24 Real-time data processing method and device, electronic equipment and readable storage medium Pending CN111339073A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010114107.4A CN111339073A (en) 2020-02-24 2020-02-24 Real-time data processing method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010114107.4A CN111339073A (en) 2020-02-24 2020-02-24 Real-time data processing method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN111339073A true CN111339073A (en) 2020-06-26

Family

ID=71183723

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010114107.4A Pending CN111339073A (en) 2020-02-24 2020-02-24 Real-time data processing method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111339073A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111784353A (en) * 2020-07-02 2020-10-16 北京白龙马云行科技有限公司 Real-time feature calculation method, order risk prediction method, device and order system
CN111966636A (en) * 2020-08-20 2020-11-20 中国农业银行股份有限公司 Data file fusion method and device
CN112000636A (en) * 2020-08-31 2020-11-27 民生科技有限责任公司 Statistical analysis method of user behavior based on Flink streaming processing
CN112100147A (en) * 2020-07-27 2020-12-18 杭州玳数科技有限公司 Method and system for realizing real-time acquisition from Bilog to HIVE based on Flink
CN112187407A (en) * 2020-09-25 2021-01-05 中国移动通信集团黑龙江有限公司 Real-time signaling message processing method, device, equipment and computer storage medium
CN112199351A (en) * 2020-09-30 2021-01-08 澳优乳业(中国)有限公司 Mobile sales data storage method and system, electronic equipment and storage medium
CN112667610A (en) * 2020-12-23 2021-04-16 树根互联技术有限公司 Real-time processing method and device for internet of things data and computer equipment
CN112785446A (en) * 2021-01-26 2021-05-11 中国人寿保险股份有限公司上海数据中心 Premium data self-correction real-time display method, system and storage medium
CN113110926A (en) * 2021-04-19 2021-07-13 上海华兴数字科技有限公司 Dynamic adjustment method, device, medium and electronic equipment for data consumption thread
CN113377829A (en) * 2021-05-14 2021-09-10 中国民生银行股份有限公司 Big data statistical method and device
CN113468246A (en) * 2021-07-20 2021-10-01 上海齐屹信息科技有限公司 Intelligent data counting and subscribing system and method based on OLTP
CN113806363A (en) * 2021-08-24 2021-12-17 北京偶数科技有限公司 Data processing method, device and storage medium
CN113886383A (en) * 2021-08-26 2022-01-04 拉卡拉支付股份有限公司 Data processing method, apparatus, electronic device, storage medium and program product
CN114510708A (en) * 2021-12-28 2022-05-17 奇安信科技集团股份有限公司 Real-time data warehouse construction, abnormal detection method, device, equipment and product
CN115033556A (en) * 2022-06-30 2022-09-09 唯品会(珠海)电子商务有限公司 Data processing method, data processing device, storage medium and computer equipment
CN116069600A (en) * 2021-11-03 2023-05-05 中国移动通信集团设计院有限公司 Method, device, equipment and storage medium for analyzing data in real time

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108427711A (en) * 2018-01-31 2018-08-21 北京三快在线科技有限公司 Real-time data warehouse, real-time data processing method, electronic equipment and storage medium
CN108492150A (en) * 2018-04-11 2018-09-04 口碑(上海)信息技术有限公司 The determination method and system of entity temperature
CN109284334A (en) * 2018-09-05 2019-01-29 拉扎斯网络科技(上海)有限公司 Real-time database synchronization method and device, electronic equipment and storage medium
CN109597842A (en) * 2018-12-14 2019-04-09 深圳前海微众银行股份有限公司 Data real-time computing technique, device, equipment and computer readable storage medium
CN109753531A (en) * 2018-12-26 2019-05-14 深圳市麦谷科技有限公司 A kind of big data statistical method, system, computer equipment and storage medium
CN110019397A (en) * 2017-12-06 2019-07-16 北京京东尚科信息技术有限公司 For carrying out the method and device of data processing
CN110197331A (en) * 2019-05-24 2019-09-03 深圳前海微众银行股份有限公司 Business data processing method, device, equipment and computer readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019397A (en) * 2017-12-06 2019-07-16 北京京东尚科信息技术有限公司 For carrying out the method and device of data processing
CN108427711A (en) * 2018-01-31 2018-08-21 北京三快在线科技有限公司 Real-time data warehouse, real-time data processing method, electronic equipment and storage medium
CN108492150A (en) * 2018-04-11 2018-09-04 口碑(上海)信息技术有限公司 The determination method and system of entity temperature
CN109284334A (en) * 2018-09-05 2019-01-29 拉扎斯网络科技(上海)有限公司 Real-time database synchronization method and device, electronic equipment and storage medium
CN109597842A (en) * 2018-12-14 2019-04-09 深圳前海微众银行股份有限公司 Data real-time computing technique, device, equipment and computer readable storage medium
CN109753531A (en) * 2018-12-26 2019-05-14 深圳市麦谷科技有限公司 A kind of big data statistical method, system, computer equipment and storage medium
CN110197331A (en) * 2019-05-24 2019-09-03 深圳前海微众银行股份有限公司 Business data processing method, device, equipment and computer readable storage medium

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111784353A (en) * 2020-07-02 2020-10-16 北京白龙马云行科技有限公司 Real-time feature calculation method, order risk prediction method, device and order system
CN111784353B (en) * 2020-07-02 2024-01-30 北京白龙马云行科技有限公司 Real-time feature calculation method, order risk prediction device and order system
CN112100147A (en) * 2020-07-27 2020-12-18 杭州玳数科技有限公司 Method and system for realizing real-time acquisition from Bilog to HIVE based on Flink
CN112100147B (en) * 2020-07-27 2024-06-07 杭州玳数科技有限公司 Method and system for realizing real-time acquisition from Binlog to HIVE based on Flink
CN111966636A (en) * 2020-08-20 2020-11-20 中国农业银行股份有限公司 Data file fusion method and device
CN112000636A (en) * 2020-08-31 2020-11-27 民生科技有限责任公司 Statistical analysis method of user behavior based on Flink streaming processing
CN112187407A (en) * 2020-09-25 2021-01-05 中国移动通信集团黑龙江有限公司 Real-time signaling message processing method, device, equipment and computer storage medium
CN112199351A (en) * 2020-09-30 2021-01-08 澳优乳业(中国)有限公司 Mobile sales data storage method and system, electronic equipment and storage medium
CN112667610A (en) * 2020-12-23 2021-04-16 树根互联技术有限公司 Real-time processing method and device for internet of things data and computer equipment
CN112785446A (en) * 2021-01-26 2021-05-11 中国人寿保险股份有限公司上海数据中心 Premium data self-correction real-time display method, system and storage medium
CN113110926A (en) * 2021-04-19 2021-07-13 上海华兴数字科技有限公司 Dynamic adjustment method, device, medium and electronic equipment for data consumption thread
CN113110926B (en) * 2021-04-19 2024-12-27 上海华兴数字科技有限公司 Method, device, medium and electronic device for dynamically adjusting data consumption thread
CN113377829A (en) * 2021-05-14 2021-09-10 中国民生银行股份有限公司 Big data statistical method and device
CN113468246A (en) * 2021-07-20 2021-10-01 上海齐屹信息科技有限公司 Intelligent data counting and subscribing system and method based on OLTP
CN113806363A (en) * 2021-08-24 2021-12-17 北京偶数科技有限公司 Data processing method, device and storage medium
CN113886383A (en) * 2021-08-26 2022-01-04 拉卡拉支付股份有限公司 Data processing method, apparatus, electronic device, storage medium and program product
CN116069600A (en) * 2021-11-03 2023-05-05 中国移动通信集团设计院有限公司 Method, device, equipment and storage medium for analyzing data in real time
CN114510708A (en) * 2021-12-28 2022-05-17 奇安信科技集团股份有限公司 Real-time data warehouse construction, abnormal detection method, device, equipment and product
CN114510708B (en) * 2021-12-28 2025-02-18 奇安信科技集团股份有限公司 Real-time data warehouse construction, anomaly detection methods, devices, equipment and products
CN115033556A (en) * 2022-06-30 2022-09-09 唯品会(珠海)电子商务有限公司 Data processing method, data processing device, storage medium and computer equipment

Similar Documents

Publication Publication Date Title
CN111339073A (en) Real-time data processing method and device, electronic equipment and readable storage medium
CN111209352B (en) Data processing method and device, electronic equipment and storage medium
US10122788B2 (en) Managed function execution for processing data streams in real time
CN112988741B (en) Real-time service data merging method and device and electronic equipment
CN112506486B (en) Search system establishment method, device, electronic device and readable storage medium
CN111444158B (en) Long-term and short-term user portrait generation method, device, equipment and readable storage medium
CN113076729B (en) Method and system for importing report, readable storage medium and electronic equipment
CN105989163A (en) Data real-time processing method and system
CN114490137A (en) Method, device, electronic device and readable storage medium for real-time statistics of business data
CN111290910A (en) Log processing method, device, server and storage medium
CN112685456A (en) User access data processing method and device and computer system
CN114036174B (en) Data updating method, device, equipment and storage medium
CN119359413B (en) E-commerce order analysis and regulation method and system
CN115617480A (en) Task scheduling method, device and system and storage medium
CN108696559B (en) Stream processing method and device
CN117131059A (en) Report data processing methods, devices, equipment and storage media
CN113656369A (en) Log distributed streaming acquisition and calculation method in big data scene
KR20220115859A (en) Edge table representation of the process
CN118261703A (en) Full-link transaction view construction method and device, electronic equipment and storage medium
CN112148762A (en) Statistical method and device for real-time data flow
CN116756115A (en) Power grid multi-temporal modeling method, device and electronic equipment based on graph database
CN115328702A (en) Log data backup method, device, electronic device and storage medium
CN114490087A (en) Method and device for acquiring available time of server cluster, electronic equipment and medium
CN113836411A (en) Data processing method and device and computer equipment
CN113760836A (en) Wide table calculation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200626