Task data acquisition method and device
Technical Field
The disclosure relates to the technical field of computers, in particular to a task data acquisition method and device.
Background
On a Hadoop big data platform, MapReduce is a widely applied data processing framework, the platform also provides functions of monitoring a MapReduce task and checking execution logs, but the success of the task does not represent that all data processing is correct, and at this time, more precise execution process data needs to be collected, for example, whether the data volume is correct or not is judged by counting the input and output data volumes.
At present, a counter is provided in a MapReduce framework and is used for collecting necessary counts in a task execution process so as to count information such as resource consumption, input and output data total amount and the like. If other counters are needed, a custom counter and counting logic can be added in the processing process, and the counting value of the counter is searched after the task operation is finished.
Since each counter can only obtain one result data, the statistical requirements are difficult to meet when the acquisition requirements are finer or counts in certain dimensions are counted. For example, counting the number of rows of each input data file separately, a counter needs to be added for each file, and if counting is performed according to more conditions, such as counting according to the input file and the writing date, enumeration is no longer possible, and thus implementation by the counter is not possible.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The purpose of the present disclosure is to provide a task data acquisition method and a task data acquisition apparatus, which are used to overcome, at least to some extent, the problem that it is difficult to meet the data statistics requirement during the operation of MapReduce framework due to the limitations and defects of the related art.
According to a first aspect of the embodiments of the present disclosure, there is provided a task data acquisition method, including: establishing n corresponding data collectors for the n subtasks; acquiring execution information of the subtasks, and acquiring preset statistical dimensions recorded in the data collector according to the execution information; searching a statistic value corresponding to the preset statistic dimension in the data collector; judging a statistic type corresponding to the preset statistic dimension, and updating the statistic value according to the statistic type and the execution information; and after the n subtasks are executed, summarizing the n data collectors.
In an exemplary embodiment of the present disclosure, the statistical categories include counts, extrema, summaries.
In an exemplary embodiment of the present disclosure, the updating the statistics according to the statistics category and the execution information includes:
and when the statistic type is counting, adding the statistic to generate a new value, and replacing the statistic with the new value.
In an exemplary embodiment of the present disclosure, the updating the statistics according to the statistics category and the execution information includes:
and when the statistic type is a maximum value, acquiring a value corresponding to the preset statistic dimension in the execution information, comparing the value with the statistic value, and writing the larger value of the two values into the data collector as a new statistic value.
In an exemplary embodiment of the present disclosure, the updating the statistics according to the statistics category and the execution information includes:
and when the statistic type is a minimum value, acquiring a value corresponding to the preset statistic dimension in the execution information, comparing the value with the statistic value, and writing the smaller value of the two as a new statistic value into the data collector.
In an exemplary embodiment of the present disclosure, the updating the statistics according to the statistics category and the execution information includes:
and when the statistic type is summary, acquiring a numerical value corresponding to the preset statistic dimension in the execution information, and writing the sum of the numerical value and the statistic value into the data collector as a new statistic value.
In an exemplary embodiment of the present disclosure, the aggregating the n data collectors includes:
and converting the data in the n data collectors into a preset format one by one, and writing the data into a statistical file corresponding to the task through a preset output method.
According to a second aspect of the embodiments of the present disclosure, there is provided a task data acquisition apparatus including:
the data collector establishing module is used for establishing n corresponding data collectors for the n subtasks;
the data identification module is used for acquiring the execution information of the subtasks and acquiring the preset statistical dimension recorded in the data collector according to the execution information;
the data value determining module is set to search the statistic value corresponding to the preset statistic dimension in the data collector;
a value updating module, configured to determine a statistic type corresponding to the preset statistic dimension, and update the statistic value according to the statistic type and the execution information;
the data summarizing module is set to summarize the n data collectors after the n subtasks are executed;
according to a third aspect of the present disclosure, there is provided a task data acquisition apparatus comprising: a memory; and a processor coupled to the memory, the processor configured to perform the method of any of the above based on instructions stored in the memory.
According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements a task data collection method as recited in any one of the above.
According to the embodiment of the method and the device, the task information is collected in real time for each subtask according to the preset conditions in the MapReduce execution process, the statistical data is updated, the statistical data is collected after the task is executed, the more precise data acquisition of the MapReduce execution process data can be realized without increasing other tasks and without remarkably increasing the resource consumption, the more comprehensive inspection and diagnosis can be performed on the MapReduce execution process before result checking or query of the execution log, and the refined MapReduce execution process monitoring and abnormal discovery can be realized.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.
Fig. 1 schematically illustrates a flow chart of a task data collection method in an exemplary embodiment of the present disclosure.
Fig. 2 schematically illustrates a flow chart of a task data collection method in an exemplary embodiment of the present disclosure.
Fig. 3 schematically illustrates a flow chart of a task data collection method in an exemplary embodiment of the present disclosure.
Fig. 4 schematically illustrates a block diagram of a task data collection device in an exemplary embodiment of the present disclosure.
Fig. 5 schematically illustrates a block diagram of an electronic device in an exemplary embodiment of the disclosure.
Fig. 6 schematically illustrates a schematic diagram of a computer-readable storage medium in an exemplary embodiment of the disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Further, the drawings are merely schematic illustrations of the present disclosure, in which the same reference numerals denote the same or similar parts, and thus, a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The following detailed description of exemplary embodiments of the disclosure refers to the accompanying drawings.
Fig. 1 schematically illustrates a flow chart of a task data collection method in an exemplary embodiment of the present disclosure. Referring to fig. 1, a task data collection method 100 may include:
step S102, establishing n corresponding data collectors for the n subtasks;
step S104, acquiring the execution information of the subtasks, and acquiring the preset statistical dimension recorded in the data collector according to the execution information;
step S106, finding out a statistic corresponding to the preset statistic dimension in the data collector;
step S108, judging a statistic type corresponding to the preset statistic dimension, and updating the statistic value according to the statistic type and the execution information;
and step S110, after the n subtasks are executed, summarizing the n data collectors.
According to the embodiment of the method and the device, the task information is collected in real time for each subtask according to the preset conditions in the MapReduce execution process, the statistical data is updated, the statistical data is collected after the task is executed, the more precise data acquisition of the MapReduce execution process data can be realized without increasing other tasks and without remarkably increasing the resource consumption, the more comprehensive inspection and diagnosis can be performed on the MapReduce execution process before result checking or query of the execution log, and the refined MapReduce execution process monitoring and abnormal discovery can be realized.
The steps of the task data collection method 100 will be described in detail below.
In one embodiment of the technical scheme disclosed by the disclosure, a data collector (for example, a hash table) is respectively established in each subtask of MapReduce, and a set is maintained in the data collector, which is the same as storing collected data. And when the task is completed, writing the data in the data collector into the HDFS file system in a preset output mode, and integrating the collected data in the HDFS file system after the task is completed for subsequent monitoring and checking.
In step S102, n corresponding data collectors are established for the n subtasks.
Taking MapReduce task as an example, a data collector can be constructed for each subtask when Mapper and Reducer are initialized, and a preset statistical dimension is set for each subtask. For example, basic information of the subtasks, JobId, JobName, input file path and name, task phase, mapper or reducer implementation class name, etc. may be initialized as data collection dimensions. The preset statistical dimension can be various, and the technical personnel in the field can set according to the actual requirement.
In addition, a hash table may be constructed in each data collector to maintain a collection of collected data, a preset data dimension of the data to be collected is written into a List as a key name (key name) of the hash table, and a value or other data value to be collected is written into the hash table as a key value corresponding to the key name.
In some embodiments, the task output may also be constructed in a multi-output manner such as multiple outputs before the task is submitted, and in addition to the output data of the task itself, a file output format of the data collector is set, for example, an addNamedOut method is used to declare an output format (for example, Text format) of the data collection file in the data collector, so that the data collection file is combined after the task is completed.
In step S104, the execution information of the subtask is obtained, and the preset statistical dimension recorded in the data collector is obtained according to the execution information.
In step S106, the statistics corresponding to the preset statistical dimension is found in the data collector.
In step S108, a statistical type corresponding to the preset statistical dimension is determined, and the statistical value is updated according to the statistical type and the execution information.
During the execution of the subtask, a plurality of pieces of execution information may be acquired, preset statistical dimensions recorded in a corresponding data collector of the subtask are searched for in the pieces of execution information, and when one or more of the preset statistical dimensions are found to exist in the pieces of execution information, the pieces of execution information are further processed.
Firstly, the type of the preset statistical dimension related to the execution information can be confirmed, and the statistical type corresponding to each preset statistical dimension is judged according to the preset value in the data collector. A plurality of statistical categories may be preset in each data collector and a data update method corresponding to each statistical category is provided.
In embodiments of the present disclosure, the statistical categories provided by the data collector may include, for example, counting, extreming, aggregating, and the like. In other embodiments, the data collector may further set more statistical categories and provide corresponding executable data updating methods, which is not limited by the disclosure.
FIG. 2 is a diagram illustrating the sub-steps of step S108 in one embodiment.
Referring to fig. 2, step S108 may include:
step S1081, judging a statistic type corresponding to a preset statistic dimension;
step S1082, when the statistic type is counting, adding a new value to the statistic, and replacing the new value with the statistic;
step S1083, when the statistic type is maximum, obtaining a value corresponding to a preset statistic dimension in the execution information, comparing the value with the statistic value, and writing the larger of the two as a new statistic value into a data collector;
step S1084, when the statistic category is a minimum value, obtaining a value corresponding to a preset statistic dimension from the execution information, comparing the value with the statistic value, and writing the smaller of the two as a new statistic value into a data collector;
step S1085, when the statistics category is summary, obtaining a value corresponding to a preset statistics dimension from the execution information, and writing the sum of the value and the statistics value as a new statistics value into the data collector.
In one embodiment, each data updating method provides two parameters, where the first parameter is a preset statistical dimension, that is, the key name of the hash table mentioned above; the second parameter is a value, which defaults to 1 if the statistical category is count, and does not need to be filled in, and if the statistical category is other, specific values, such as values associated with preset statistical dimensions appearing in the latest task execution information, may be filled in as needed.
For example, the methods corresponding to the statistical types of counting, extremum determination, and aggregation may be named collectCount, collectMax, collectMin, and collectSum, respectively. And updating the statistics according to the following logic:
collectCount: and taking the first parameter as a key name, finding a key value corresponding to the first parameter from the hash table, adding one to the key value to form a new key value, and then writing the first parameter and the new key value into the hash table.
collectMax: and taking the first parameter as a key name, finding a key value corresponding to the first parameter from the hash table, acquiring a numerical value corresponding to the first parameter from the execution information, writing the numerical value into the hash table as a second parameter, and writing a larger value of the key value and the second parameter into the hash table as a new key value corresponding to the first parameter. The implementation manner of collectMin is similar to that of collectMax, and the key value and the smaller value of the second parameter are used as the new key value.
collectSum: and taking the first parameter as a key name, finding a key value corresponding to the first parameter from the hash table, obtaining a numerical value corresponding to the first parameter from the execution information, writing the numerical value into the hash table as a second parameter, and writing the sum of the key value and the second parameter into the hash table as a new key value corresponding to the first parameter.
In addition to the above examples, various statistical categories such as filtering, classification statistics, and the like may be set. For the same task execution information, as the same task execution information may involve a plurality of preset statistical dimensions, data can be extracted from different angles according to the preset statistical dimensions, and the statistical data of each preset statistical dimension can be updated.
In step S110, after the n subtasks are executed, the n data collectors are summarized.
In some embodiments, aggregating the information of the n data collectors may include converting the data in the n data collectors into a preset format one by one, and writing the preset format into a statistical file corresponding to the task through a preset output method.
Still taking the MapReduce task execution process as an example, at the clear stage of the MapReduce task, data in the hash table may be taken out, and serialized item by item into a preset format (for example, Text format), where the serialization manner is to separate each dimension value and key value in the key name into a string Text through a comma, and then output the string Text into a temporary directory corresponding to the task through a multiple output.
In the Commit stage of task execution, by rewriting fileoutputcommit, on one hand, each data collection file generated by the task is merged into one file (HDFS provides merging support for text files), and on the other hand, the merged data file is copied from the temporary directory to a specific data collection directory.
In subsequent processing, the generated data file can be loaded into the Hive table, and query and analysis are performed in an SQL manner.
FIG. 3 is a flow chart of a task data collection method in one embodiment of the present disclosure.
Referring to fig. 3, a task data collection process may include, for example:
step S31, when constructing MapReduce Job, using multiple outputs, and adding a result output format of data acquisition through addFormadOutput;
step S32, constructing a data collector when the Mapper and the Reducer are initialized, and storing task information which can be selected as a data collection dimension, wherein the task information can be task basic information such as JobId, JobName, input file path and name, task phase, Mapper or Reducer realization class name and the like;
step S33, respectively collecting quantity, extreme value and summary value by the methods of collectincount, collectitMax, collectitMin, collectitSum and the like of a collectitor in the executing process of Mapper and Reducer;
step S34, when the Mapper and Reducer are finished, in the clear method, writing the data collected in the collector into the temporary collection file by the multiple outputs.write method;
and step S35, rewriting FileOutputCommitter, merging the data acquisition files in the temporary directory when the MapReduce task is completed, and transferring the merged data acquisition files to the final data acquisition directory.
According to the embodiment of the application, the data collector is constructed in a data set mode in the MapReduce task execution process, multi-statistical-dimension multi-statistical-type data collection is supported, and when the task is completed, data in the data collector is written into an HDFS file system in a multi-output mode, so that the problem that the data statistical requirements are difficult to meet in the MapReduce framework operation process can be solved.
Corresponding to the method embodiment, the present disclosure also provides a task data acquisition device, which may be used to execute the method embodiment.
Fig. 4 schematically illustrates a block diagram of a task data collection device in an exemplary embodiment of the present disclosure.
Referring to fig. 4, the task data collecting apparatus 400 may include:
a data collector creation module 402 configured to create n data collectors corresponding to the n subtasks;
a data identification module 404 configured to acquire execution information of the subtasks, and acquire a preset statistical dimension recorded in the data collector according to the execution information;
a data determining module 406 configured to find a statistic corresponding to the preset statistic dimension in the data collector;
a value updating module 408 configured to determine a statistic category corresponding to the preset statistic dimension, and update the statistic value according to the statistic category and the execution information;
the data summarization module 410 is configured to summarize the n data collectors after the n subtasks are executed.
In an exemplary embodiment of the present disclosure, the statistical categories include counts, extrema, summaries.
In an exemplary embodiment of the present disclosure, the value update module 408 is configured to:
and when the statistic type is counting, adding the statistic to generate a new value, and replacing the statistic with the new value.
And when the statistic type is a maximum value, acquiring a value corresponding to the preset statistic dimension in the execution information, comparing the value with the statistic value, and writing the larger value of the two values into the data collector as a new statistic value.
And when the statistic type is a minimum value, acquiring a value corresponding to the preset statistic dimension in the execution information, comparing the value with the statistic value, and writing the smaller value of the two as a new statistic value into the data collector.
And when the statistic type is summary, acquiring a numerical value corresponding to the preset statistic dimension in the execution information, and writing the sum of the numerical value and the statistic value into the data collector as a new statistic value.
In an exemplary embodiment of the present disclosure, the data summarization module 410 is configured to:
and converting the data in the n data collectors into a preset format one by one, and writing the data into a statistical file corresponding to the task through a preset output method.
Since the functions of the apparatus 400 have been described in detail in the corresponding method embodiments, the disclosure is not repeated herein.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 500 according to this embodiment of the invention is described below with reference to fig. 5. The electronic device 500 shown in fig. 5 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 5, the electronic device 500 is embodied in the form of a general purpose computing device. The components of the electronic device 500 may include, but are not limited to: the at least one processing unit 510, the at least one memory unit 520, and a bus 530 that couples various system components including the memory unit 520 and the processing unit 510.
Wherein the storage unit stores program code that is executable by the processing unit 510 to cause the processing unit 510 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification. For example, the processing unit 510 may execute step S102 as shown in fig. 1: establishing n corresponding data collectors for the n subtasks; step S104: acquiring execution information of the subtasks, and acquiring preset statistical dimensions recorded in the data collector according to the execution information; step S105: searching a statistic value corresponding to the preset statistic dimension in the data collector; step S108: and judging the statistic type corresponding to the preset statistic dimension, and updating the statistic value according to the statistic type and the execution information.
The memory unit 520 may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM)5201 and/or a cache memory unit 5202, and may further include a read only memory unit (ROM) 5203.
Storage unit 520 may also include a program/utility 5204 having a set (at least one) of program modules 5205, such program modules 5205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 530 may be one or more of any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 500 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 500, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 500 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 550. Also, the electronic device 500 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 560. As shown, the network adapter 560 communicates with the other modules of the electronic device 500 over the bus 530. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.
Referring to fig. 6, a program product 600 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.