A kind of data processing method and device
Technical field
This application involves big data technical field more particularly to a kind of data processing method and devices.
Background technique
Currently, the Data entries that data access platform is important as big data platform, are mainly used for realizing that data source connects
Enter and pre-processed with data buffer storage, wherein data source access mainly includes to the logs such as online message, off-line files, Binlog text
Part carries out data acquisition;Mainly the adaptation including data source format, the encrypted transmission of data, message cache for data buffer storage pretreatment
Distribute with subscribing to.
In the prior art, user must come by Flume-NG, Logstach or Scribe distributed deployment way
Data access is realized, in this way, each user requires voluntarily to carry out Technology Selection, to different acquisition tasks or preprocessing tasks
It needs using different technical solutions, and needs oneself to complete the bottom layer realization of software, therefore, development process is comparatively laborious,
Complexity causes the use cost of data access platform also relatively high.
As it can be seen that there is development processes when using data access platform is cumbersome, complicated by user in the prior art, cause to count
According to the relatively high problem of the use cost of access platform.
Summary of the invention
The embodiment of the present application provides a kind of data processing method and device, is using number to solve user in the prior art
It is cumbersome, complicated that there is development processes when according to access platform, the problem for causing the use cost of data access platform relatively high.
A kind of data processing method provided by the embodiments of the present application, it is flat applied to the data access comprising multiple Computational frames
Platform, comprising:
Data access platform receives data processing request, and the address information of active data is carried in the data processing request
With the processing task that need to be executed to the source data;
Source data is obtained according to the address information of the source data;
According to the load state of each Computational frame in the attribute information of the source data and the data access platform, determine
For executing the Computational frame of the processing task, the attribute information of the source data include at least data source, time delay size,
Total amount of data and the quality of data;
The processing task is submitted into the Computational frame, the source data is handled by the Computational frame.
A kind of data processing equipment provided by the embodiments of the present application, positioned at the data access platform comprising multiple Computational frames
In, comprising:
Receiving module, processing is requested for receiving data, and the address letter of active data is carried in the data processing request
Breath and the processing task that the source data need to be executed;
Module is obtained, for obtaining source data according to the address information of the source data;
Computational frame chooses module, based in the attribute information and the data access platform according to the source data respectively
The load state for calculating frame, determines the Computational frame for executing the processing task, the attribute information of the source data is at least
Include data source, time delay size, total amount of data and the quality of data;
Module is submitted, for the processing task to be submitted to the Computational frame, by the Computational frame to the source
Data are handled.
A kind of electronic equipment provided by the embodiments of the present application, including at least one processing unit and at least one storage
Unit, wherein the storage unit is stored with program code, when said program code is executed by the processing unit, so that
The electronic equipment executes the step of above-mentioned data processing method.
A kind of computer readable storage medium provided by the embodiments of the present application, including program code, work as said program code
When running on an electronic device, the step of making the electronic equipment execute above-mentioned data processing method.
In the embodiment of the present application, data access platform receives data processing request, carries active number in data processing request
According to address information and processing task that source data need to be executed, source data is obtained according to the address information of source data later, into
And according to the load state of each Computational frame in the attribute information of source data and data access platform, it determines and appoints for executing processing
The Computational frame of business, and processing task is submitted into the Computational frame, source data is handled by the Computational frame, in this way,
User only need to will be in the address information of source data and the processing task write-in data processing request that operates to source data, so that it may
To carry out order publication to data access platform, cumbersome Floor layer Technology is real when without being focused on again using data access platform
It is existing, the difficulty using data access platform is reduced, the use cost of data access platform is reduced.
Detailed description of the invention
Fig. 1 is data processing method flow chart provided by the embodiments of the present application;
Fig. 2 is data processing equipment structure chart provided by the embodiments of the present application;
Fig. 3 is that the hardware configuration of the electronic equipment provided by the embodiments of the present application for realizing data processing method is illustrated
Figure.
Specific embodiment
The embodiment of the present application is intended to shield user and uses the software reality for acquiring Computational frame when data access platform to bottom
It is existing, and data definition, data acquisition and data consumption interface are provided to user, simplify point that user in the prior art needs to complete
Cloth deployment and configuration work, realize the external unified management of data access platform, and can provide a part of generalization
Pretreatment potentiality (data pick-up, data desensitization, data merge etc.), it is not necessary to user carries out low level development again, and then reduces and use
Family uses the difficulty of data access platform.
The embodiment of the present application is described in further detail with reference to the accompanying drawings of the specification.
Embodiment one
As shown in Figure 1, being data processing method flow chart provided by the embodiments of the present application, it is applied to include multiple calculation blocks
The data access platform of frame, comprising the following steps:
S101: data access platform receives data processing request, and the ground of active data is carried in the data processing request
Location information and the processing task that source data need to be executed.
Wherein, data processing request is the request of XML format, parses and takes in the available data processing request of XML file
The address information of the source data of band and the processing task that source data need to be executed, here, the place for needing data access platform to execute
Reason task such as data acquisition, encrypted transmission, subscription distribution etc..
S102: source data is obtained according to the address information of source data.
S103: it according to the load state of each Computational frame in the attribute information of source data and data access platform, determines and uses
In the Computational frame for executing the processing task.
Optionally, the attribute information of source data includes at least data source, time delay size, total amount of data and the quality of data,
Wherein, data source such as File Transfer Protocol (File Transfer Protocol, FTP), Hadoop distributed file system
(Hadoop Distributed File System, HDFS), Kafka, serial data transport protocol (Serial Data
Transport Protocol, SDTP) etc.;Time delay size refers to the time that data need to wait in transmission process;Total amount of data
Refer to the data volume size for needing Computational frame to handle;The quality of data refers to the number of data that single allows to execute, if single is permitted
Perhaps the number of data executed is less, and such as 1, it is determined that the quality of data is high, if the number of data that single allows to execute is more, such as
1000, it is determined that the quality of data is low.
In the specific implementation process, it after obtaining source data, can be determined according to the attribute information of source data to source number
According to processing mode, wherein the processing mode to source data includes batch processing mode and immediate processing mode, and then according to source
The load state of each Computational frame in the processing mode and data access platform of data is determined for executing the processing task
Computational frame.
Specifically, if data source is FTP or HDFS, source data can be handled with batch processing mode, if data are come
Source is Kafka or SDTP, then can handle source data with immediate processing mode.If it is determined that the time delay of data is larger, such as larger than
10s then can handle source data with batch processing mode, otherwise, can handle source data with immediate processing mode.If it is determined that data
Total amount is larger, and if data are greater than 1GB, then can handle source data with batch processing mode otherwise can be at immediate processing mode
Manage source data.If it is determined that the quality of data is high, then otherwise can handle source data with immediate processing mode can use batch processing side
Formula handles source data.
Also, in the absence of conflict, the method for above-mentioned determining source data processing mode can be used in combination, right at this
Specific combination is without limitation.
Further, however, it is determined that source data is handled with immediate processing mode, it is determined that with real-time streams in data access platform
Processing mode handles the Computational frame of data flow, and selection load is made lower than the Computational frame of first threshold from these Computational frames
For the Computational frame for executing the processing task;If it is determined that handling source data with batch processing mode, it is determined that data access platform
In with batch processing mode processing data flow Computational frame, from these Computational frames selection load be lower than second threshold calculating
Frame is as the Computational frame for executing the processing task.
S104: the processing task is submitted into determining Computational frame, source data is handled by the Computational frame.
Optionally, also include destination address in data processing request, Computational frame after handling source data,
The data of generation can be stored in destination address.
In addition, user with customized Computational frame and can introduce the Computational frame of oneself deployment, for example, can be
Engine label is added in access interface can define the type of Computational frame.
In the above process, according to the load state of each Computational frame in the attribute information of source data and data access platform,
Before determining the Computational frame for executing the processing task, it can determine that there is no user is customized in data access platform
Computational frame, however, it is determined that there are the customized Computational frames of user in data access platform, then execute the processing determining
When the Computational frame of task, deploy label can be added to introduce the Computational frame of user oneself deployment, and no longer select number
According to the Computational frame in access platform.
In the embodiment of the present application, data access platform receives data processing request, carries active number in data processing request
According to address information and processing task that source data need to be executed, source data is obtained according to the address information of source data later, into
And according to the load state of each Computational frame in the attribute information of source data and data access platform, it determines and appoints for executing processing
The Computational frame of business, and processing task is submitted into the Computational frame, source data is handled by the Computational frame, in this way,
User only need to will be in the address information of source data and the processing task write-in data processing request that operates to source data, so that it may
To carry out order publication to data access platform, cumbersome Floor layer Technology is real when without being focused on again using data access platform
It is existing, the difficulty using data access platform is reduced, the use cost of data access platform is reduced.
Embodiment two
Data access platform in the embodiment of the present application includes multiple Computational frames, such as MR, Spark, Storm, is mentioned
A kind of real-time data imputing system generalization solution based on network web services has been supplied, has mainly been located in advance including data access, data
Reason and data storage three zones, wherein data access mainly provides off-line files (FTP/HDFS), online message (HTTP/
Kafka/SDTP it) is connect with the encryption of the data source datas such as journal file (MySQL Binlog/Syslog/log4j/logback)
Enter;Data prediction mainly provides data source definitions (field seperator, field description, data Source Description, field type), data
Source field desensitization, the subscription distribution of data source and data check;Data storage is mainly provided to be disappeared based on mainstreams such as Kafka/MetaQ
The data storage of queue mechanism is ceased, and can provide and encapsulate perfect external interface with for users to use.
In the specific implementation process, data access platform mainly includes three external interfaces: access interface,
Definition interface and consume interface, unified configuration template is externally provided using these interfaces, and user can pass through letter
These interfaces are realized in single object access protocol (Simple Object Access Protocol, SOAP) request, and then utilize this
A little interfaces, which issue data access platform, orders.
Specifically, access interface mainly provides the source of data source, the target storage position of data source, data prediction
Logical process and bound with data source definitions, the XML format of access interface is as follows:
Wherein, the Source tag definition source of data, wherein Type label is used for the type in mark data source,
Such as FTP/HDFS/Kafka/HTTP/SDTP, param label is for identifying other association attributes, such as other related categories of FTP
Property includes Server, Port, UserName, Passwd, FilePath, isGZ etc.;The data processing completion of Sink tag definition
Buffer address later, wherein Type mainly includes Kafka/Redis/Elasticsearch/FTP/HDFS etc.;
The pretreated logical process of Interceptor Tag identification data, wherein Type mainly includes that field merges (concat), just
Then extract (regularExtractor), field extracts (indexExtractor), based on types such as md5/Base64/SHA256
Field desensitization (desensitization) etc.;Definition of the dataSource label for data access process and data source is real
It now binds, user can be prompted to define data source format first the data source platform that name is not present, then carry out data and connect
Enter.
The format and relevant field information of definition interface main definitions data source, definition interface
XML format is as follows:
Wherein, outermost layer label is dataSource, and binding is realized in the definition for data access process and data source,
In, name is unique index, and for determining that unique data source identifies, fieldDelimiter label is for defining field isolation
Symbol, length label are used for defining field number, description label for defining data Source Description, fields label
It mainly include field name (name label), field index position (index label), field description in enumerating all fields
(description label) and field type (type label) etc..
Consume interface mainly provides consumption of the user to the data source data, and the XML format of consume interface is as follows:
Wherein, name label is the title of data source, and batch is primary available number of data,
BatchDurtition is the maximum time for obtaining data.
Data access platform in the embodiment of the present application supports a variety of Computational frames, has not only included real-time Computational frame but also has included
Batch processing Computational frame, wherein real-time Computational frame such as Storm, Spark Streaming;Batch processing Computational frame such as MR,
Spark。
In the specific implementation process, the XML request that data access platform submits Rest layers, can first parse XML file
The processing task that obtains the address information of source data and need to execute to source data obtains source according to the address information of source data later
Data, and then according to the attribute information of source data, such as data source, time delay size, total amount of data and the quality of data, determination pair
The processing mode of source data, and then the calculation block for executing the processing task is determined according to the processing mode to source data
Frame.
For example, perhaps the off-line datas such as HDFS can transfer to MR/Spark etc. to carry out batch processing and Kafka or SDTP to FTP
Etc. tasks the real-time Computational frame such as Storm/Spark Streaming can be selected to handle.
In the specific implementation process, the loading condition that each Computational frame in data access platform can also be obtained in real time,
When for the processing task choosing Computational frame, the loading condition for considering each Computational frame can be combined with, to realize each meter
The load balancing for calculating frame, makes the performance of data access platform get the greatest degree of optimization.
Further, after Computational frame has handled source data, the data of generation can also be stored in destination address
In, destination address is also that parsing XML file obtains.
Optionally, data access platform can also provide the customized selection Computational frame of user and introduce the meter of oneself deployment
Frame is calculated, for example, the type of Computational frame can be defined by adding engine label in access interface, is selected for subtask
When handling Computational frame, however, it is determined that it is added with deploy label in access interface, then introduces the service of user oneself deployment,
Otherwise, the Computational frame provided using data access platform.
In the specific implementation process, it to each Computational frame node in data access platform, is realized using agent mode
The monitoring of submission task and Computational frame loading condition on the node, for example, in the nimbus node of Strom cluster, flink
Agent is all disposed on the JobManager node of cluster, spark client node, task on each node is mentioned to realize
Hand over the monitoring with computing resource.
In addition, data access platform can also provide on Docker solution, convenient for users to key deployment, and
Integrated nginx load-balancing mechanism, Storm on docker, Flink on docker and Spark on docker etc. are solved
Scheme.
Data access platform provided by the embodiments of the present application can realize data access energy with intelligent selection Computational frame
Power and pretreated ability access task to the data source that user submits, its calculating are selected by SmartRouter module
Frame, while user customized can also select Computational frame and introduce the Computational frame of oneself deployment, and integrated on
Docker solution, the program compare artificial distributed deployment mode, can simplify user and dispose the numerous of related service
Complexity that is trivial, reducing configuration.In addition, the unitized processing that data access platform no longer uses data to converge, but by data
Pretreatment is divided into the smallest subtask and is distributed on each Computational frame node, and data prediction scheme unitized in this way can
To disperse the pressure of single Computational frame node, the working efficiency of data access platform is improved to the maximum extent.
Embodiment three
Based on the same inventive concept, it is additionally provided in the embodiment of the present application at one kind data corresponding with data processing method
Manage device, since the principle that the device solves the problems, such as is similar to the embodiment of the present application data processing method, the reality of the device
The implementation for the method for may refer to is applied, overlaps will not be repeated.
As shown in Fig. 2, being data processing equipment structure chart provided by the embodiments of the present application, which, which is located at, includes multiple meters
In the data access platform for calculating frame, comprising:
Receiving module 201, processing is requested for receiving data, and the address of active data is carried in the data processing request
Information and the processing task that the source data need to be executed;
Module 202 is obtained, for obtaining source data according to the address information of the source data;
Computational frame chooses module 203, in the attribute information and the data access platform according to the source data
The load state of each Computational frame determines the Computational frame for executing the processing task, the attribute information of the source data
Including at least data source, time delay size, total amount of data and the quality of data;
Module 204 is submitted, for the processing task to be submitted to the Computational frame, by the Computational frame to described
Source data is handled.
Optionally, the Computational frame is chosen module 203 and is specifically used for:
The processing mode to the source data is determined according to the attribute information of the source data, wherein to the source data
Processing mode include batch processing mode and immediate processing mode;
According to the load state of the processing mode to the source data and each Computational frame in the data access platform, really
Determine the Computational frame for executing the processing task.
Optionally, the Computational frame is chosen module 203 and is specifically used for:
If it is determined that handling the source data with immediate processing mode, it is determined that in the data access platform with real-time streams at
Reason mode handles the Computational frame of data flow, selected from the Computational frame load lower than first threshold Computational frame as
Execute the Computational frame of the processing task;
If it is determined that handling the source data with batch processing mode, it is determined that with batch processing mode in the data access platform
The Computational frame for handling data flow selects load lower than the Computational frame of second threshold as execution institute from the Computational frame
State the Computational frame of processing task.
Optionally, described device further includes determining module 205:
The determining module 205, based in the attribute information and the data access platform according to the source data respectively
The load state of frame is calculated, before determining the Computational frame for executing the processing task, determines that there is no user is customized
Computational frame.
Optionally, the submission module 204 is also used to:
If it is determined that there are the customized Computational frames of user, then it is customized the processing task to be submitted to the user
Computational frame is handled the source data by the customized Computational frame of the user.
Example IV
As shown in figure 3, the hardware configuration for the electronic equipment provided by the embodiments of the present application for realizing data processing shows
It is intended to, including at least one processing unit 301 and at least one storage unit 302, wherein storage unit is stored with program
Code, when program code is executed by the processing unit, so that the step of electronic equipment executes above-mentioned data processing method.
Embodiment five
A kind of computer readable storage medium provided by the embodiments of the present application, including program code, work as said program code
When running on an electronic device, make electronic equipment execute above-mentioned data processing method the step of.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The application is process of the reference according to method, apparatus (system) and computer program product of the embodiment of the present application
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Although the preferred embodiment of the application has been described, it is created once a person skilled in the art knows basic
Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as
It selects embodiment and falls into all change and modification of the application range.
Obviously, those skilled in the art can carry out various modification and variations without departing from the essence of the application to the application
Mind and range.In this way, if these modifications and variations of the application belong to the range of the claim of this application and its equivalent technologies
Within, then the application is also intended to include these modifications and variations.