[go: up one dir, main page]

CN111857538A - Data processing method, device and storage medium - Google Patents

Data processing method, device and storage medium Download PDF

Info

Publication number
CN111857538A
CN111857538A CN201910337777.XA CN201910337777A CN111857538A CN 111857538 A CN111857538 A CN 111857538A CN 201910337777 A CN201910337777 A CN 201910337777A CN 111857538 A CN111857538 A CN 111857538A
Authority
CN
China
Prior art keywords
data
reducer
processed
merged
read operation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910337777.XA
Other languages
Chinese (zh)
Other versions
CN111857538B (en
Inventor
范振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wodong Tianjun Information Technology Co Ltd
Original Assignee
Beijing Wodong Tianjun Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wodong Tianjun Information Technology Co Ltd filed Critical Beijing Wodong Tianjun Information Technology Co Ltd
Priority to CN201910337777.XA priority Critical patent/CN111857538B/en
Publication of CN111857538A publication Critical patent/CN111857538A/en
Application granted granted Critical
Publication of CN111857538B publication Critical patent/CN111857538B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device
    • G06F3/0676Magnetic disk device

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An embodiment of the application provides a data processing method, a data processing device and a storage medium, wherein the method comprises the following steps: acquiring data to be processed, wherein the data to be processed is file data written by a first actuator, the number of files corresponding to the file data is N times of the number of reducers, N is an integer greater than 1, and the reducers are located in a second actuator; merging the data to be processed corresponding to the same Reducer to obtain merged data, wherein the number of the merged data is less than that of files corresponding to the file data; receiving a first reading operation sent by a second actuator, wherein the first reading operation is used for reading the merged data corresponding to the Reducer; and sending the merged data corresponding to the Reducer according to the first reading operation. The embodiment of the application can reduce the random reading times, thereby accelerating the shuffle read process and reducing the disk IO resource and network consumption.

Description

Data processing method, device and storage medium
Technical Field
The embodiment of the application relates to the technical field of information processing, in particular to a data processing method, a data processing device and a storage medium.
Background
In the MapReduce framework, a shuffle is a bridge connecting a Map and a Reduce, wherein the output of the Map needs to be used in the Reduce through the shuffle, so that the performance and the throughput of the whole framework are directly influenced by the performance of the shuffle.
Spark, as an implementation of MapReduce framework, naturally also implements the logic of shuffle. When Spark runs an Application (Application) containing a shuffle process, an Executor (Executor) is responsible for writing data generated by a task into a disk after the task is finished except for running the task (task), and provides shuffle data for other executors. In particular, Spark may include a write phase and a read phase, i.e.: a shuffle and a shuffle read. During the shuffle process, a Mapper in the executor fragments according to the number of reducers in other executors, then stores the fragmented data into a special directory of Spark, and manages and provides services by a peripheral component external shuffle service; in the process of shuffle read, the Reducer reads the data corresponding to the Reducer in each Mapper, so that a large amount of random reads are caused, the whole shuffle read process is very slow, and many unnecessary disk IO resources and network consumption exist.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device and a storage medium, so that a shuffle read process is accelerated, and unnecessary disk IO (input/output) resources and network consumption are reduced.
In a first aspect, an embodiment of the present application provides a data processing method, including:
Acquiring data to be processed, wherein the data to be processed is file data written by a first actuator, the number of files corresponding to the file data is N times of the number of reducers, N is an integer greater than 1, and the reducers are located in a second actuator;
merging the data to be processed corresponding to the same Reducer to obtain merged data, wherein the number of the merged data is less than that of files corresponding to the file data;
receiving a first reading operation sent by the second executor, wherein the first reading operation is used for reading the merged data corresponding to the Reducer;
and sending the merged data corresponding to the Reducer according to the first reading operation.
In a possible implementation, the number of merged data is equal to the number of reducers.
In one possible implementation, each merged data corresponds to a block id of a corresponding Reducer.
In a possible implementation manner, after the merging the to-be-processed data corresponding to the same Reducer to obtain merged data, the merging may further include: and updating the data to be processed.
In a possible implementation manner, the data to be processed is metadata corresponding to at least one application.
In a possible implementation, each application corresponds to a thread, the life cycle of the thread is the same as that of the corresponding application, and the thread is used for periodically scanning the data structure related to the corresponding application and performing the merging processing operation each time new file data is formed.
In one possible implementation, the thread employs a timer as a driver.
In a possible implementation, the data processing method may further include:
if the situation that an abnormality occurs in the merging processing process is monitored, receiving a second reading operation sent by the second actuator, wherein the second reading operation is used for reading data before merging corresponding to the Reducer;
and sending the data before combination corresponding to the Reducer according to the second reading operation.
In a second aspect, an embodiment of the present application provides a data processing apparatus, including:
the device comprises an acquisition module, a first actuator and a second actuator, wherein the acquisition module is used for acquiring data to be processed, the data to be processed is file data written by the first actuator, the number of files corresponding to the file data is N times of the number of reducers, N is an integer larger than 1, and the reducers are positioned in the second actuator;
The processing module is used for merging the data to be processed corresponding to the same Reducer to obtain merged data, and the number of the merged data is less than that of files corresponding to the file data;
a receiving module, configured to receive a first read operation sent by the second executor, where the first read operation is used to read merged data corresponding to the Reducer;
and the sending module is used for sending the merged data corresponding to the Reducer according to the first reading operation.
In a possible implementation, the number of merged data is equal to the number of reducers.
In a possible implementation, the processing module may be further configured to: and after the data to be processed corresponding to the same Reducer are merged and the merged data are obtained, updating the data to be processed.
In a possible implementation, the data processing apparatus may further include:
the monitoring module is used for monitoring whether an exception occurs in the merging processing process;
correspondingly, the receiving module is further configured to receive a second reading operation sent by the second actuator when the monitoring module monitors that an abnormality occurs in the merging process, where the second reading operation is used to read the data before merging corresponding to the Reducer;
And the sending module is further used for sending the data before combination corresponding to the Reducer according to the second reading operation.
In a third aspect, an embodiment of the present application provides a data processing apparatus, including: a memory and a processor;
wherein the memory is to store program instructions;
the processor is configured to call and execute the program instructions stored in the memory, and when the processor executes the program instructions stored in the memory, the data processing apparatus is configured to execute the method according to any implementation manner of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to perform the method according to any implementation manner of the first aspect.
The data processing method, the data processing device and the storage medium provided by the embodiment of the application obtain data to be processed, wherein the data to be processed is file data written by a first actuator, the number of files corresponding to the file data is N times of the number of reducers, N is an integer greater than 1, and the reducers are located in a second actuator; merging the data to be processed corresponding to the same Reducer to obtain merged data, wherein the number of the merged data is less than that of files corresponding to the file data; receiving a first reading operation sent by a second executor, wherein the first reading operation is used for reading the merged data corresponding to the Reducer; and sending the merged data corresponding to the Reducer according to the first reading operation. Because the data to be processed corresponding to the same Reducer are merged and the number of the merged data is less than that of the files corresponding to the file data, the random reading times can be reduced, the shuffle read process is accelerated, and the disk IO (input/output) resource and network consumption are reduced.
Drawings
Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a data processing method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a data processing method according to another embodiment of the present application;
fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a data processing apparatus according to another embodiment of the present application;
fig. 6 is a schematic structural diagram of a data processing apparatus according to yet another embodiment of the present application.
Detailed Description
It should be understood that the numbers "first" and "second" in the embodiments of the present application are used for distinguishing similar objects, and are not necessarily used for describing a specific order or sequence order, and should not be construed as limiting the embodiments of the present application in any way.
The "and/or" mentioned in the embodiments of the present application describes an association relationship of associated objects, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
First, an application scenario and a part of vocabulary related to the embodiments of the present application will be described.
Kubernetes, a container cluster orchestration and management system open by google (google).
Spark, the top-level item of the new generation distributed memory computing framework, Apache open source. Compared with a HadoopMap-Reduce calculation framework, the Spark retains the intermediate calculation result in the memory, and the speed is increased by 10-100 times. The Shuffle is a specific staging (phase) in the MapReduce framework, which is between the Map phase and the Reduce phase, and when the output result of the Map is to be used by the Reduce, the output result needs to be hashed by a key value (key) and distributed to each Reduce, which is the Shuffle.
The ESS, a peripheral component of spark, is able to register the location of data written during a shuffle write to the component and provide shuffle read services.
And the application corresponds to a user job in Spark.
Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application. As shown in fig. 1, the application scenarios of the embodiment of the present application may include, but are not limited to: actuator 1, actuator 2, ESS 1, ESS 2, and actuator 3. Of course, the application scenario of the embodiment of the present application does not limit the number of actuators and ESS, and the embodiment only takes 3 actuators and two ESS as examples for description.
The executor 1 executes a Task (Task)0 and a Task 1, namely a Map Task set (MapTask set) executed by the executor 1 comprises the Task 0 and the Task 1; the executor 2 executes the task 2 and the task 3, namely the Map task set executed by the executor 2 comprises the task 2 and the task 3; the actuator 1 and the actuator 2 are mutually independent and do not interfere with each other. In addition, the executor 1 and the executor 2 are also responsible for writing shuffle data, that is, writing data generated by executing the task into the ESS, wherein the executor 1 writes data generated by executing the task 0 and the task 1 into the ESS1, the executor 2 writes data generated by executing the task 2 and the task 3 into the ESS 2, and the ESS1 and the ESS 2 are independent from each other and do not interfere with each other.
The executor 3 executes the task 4, the task 5, and the task 6 to read the shuffle data, that is, the Reduce task set (Reduce task) executed by the executor 3 includes the task 4, the task 5, and the task 6. Specifically, the executor 3 may read data required to execute the task 4, the task 5, and the task 6 from the ESS1, the ESS 2, respectively.
In the application scenario, one end of the shuffle data is written to is called a Map end, each task that generates data at the Map end is called a Mapper, and for example, task 0, task 1, task 2, and task 3 are all called mappers. Correspondingly, one end of the shuffle data is called a Reduce end, each task of the Reduce end reading the data is called a Reducer, and for example, task 4, task 5 and task 6 are all called reducers.
In the following embodiments, the executor that writes the shuffle data is referred to as a first executor, the executor that reads the shuffle data is referred to as a second executor, and the shuffle data written in the ESS is referred to as to-be-processed data.
For the prior art, exemplarily, when a certain application containing a shuffle process is run, the application corresponds to M mappers and R reducers, and then the whole shuffle read process generates M × R random reads. In practical application, both M and R are very large, and the number of generated random reads increases exponentially, so that the consumption of disk IO resources is huge, and the duration of the shuffle read process is long.
Based on the above problems, in the ESS, the embodiments of the present application perform merging processing on the to-be-processed data written into the ESS through the shuffle write process, thereby reducing the consumption of the disk IO resources, reducing the performance loss caused by network packet transmission, and improving the operation efficiency.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Fig. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application. The embodiment of the application provides a data processing method, which can be executed by a data processing device, and the data processing device can be realized by software and/or hardware. Illustratively, the data processing device may be embodied as an ESS, or the data processing device may be embedded in an ESS, such as a module or chip in the ESS.
As shown in fig. 2, the method of the embodiment of the present application includes:
s201, acquiring data to be processed.
The data to be processed is file data written by the first actuator, the number of files corresponding to the file data is N times of the number of reducers, N is an integer larger than 1, and the reducers are located in the second actuator.
Still referring to the application scenario shown in fig. 1 as an example, the ESS 1 receives the file data P0, the file data P1, and the file data P2 written by the first actuator (i.e., actuator 1), i.e., obtains the data to be processed. It should be noted that, for the same Mapper, the number of files corresponding to the corresponding file data is equal to the number of reducers in the second executor (i.e., executor 3). If ESS 1 corresponds to N mappers, the number of files corresponding to the file data received in ESS 1 is N times the number of reducers.
Considering that in practical applications, the space occupied by all Mapper-generated data corresponding to the ESS may be larger than the memory of the ESS, for this case, the file data P0, the file data P1, and the file data P2 are first written into the ESS, and then the remaining portions are written into the ESS, for example, the remaining portions are represented as file data P01, file data P11, and file data P21, and then the file data P01, file data P11, and file data P21 are written into the ESS. The number of files corresponding to the file data contained in the remaining part is also a natural number multiple of the number of reducers.
S202, merging the data to be processed corresponding to the same Reducer to obtain merged data.
And the number of the merged data is less than the number of the files corresponding to the file data. Since the reading of the shuffle data in the ESS by the second executor may occur during the merge process or after the merge process is completed, the merged data in this step may be data after the merge process is completed (all data after the merge process), or data during the merge process (part of data after the merge process, and part of data before the merge process), and therefore, the number of merged data is less than the number of files corresponding to the file data, but the number of merged data is greater than or equal to the number of reducers.
For example, in the case of successfully performing the merging process, M × R files are finally merged into R files.
S203, receiving a first reading operation sent by a second actuator.
The first read operation is used to read the merged data corresponding to the Reducer.
It should be noted that when reading data, it is necessary to transmit merged data in multiple threads, otherwise, bandwidth is not fully utilized, resulting in resource waste; moreover, the transmission time of a single file is too long, which also causes the performance of data processing to be greatly reduced.
And S204, sending the merged data corresponding to the Reducer according to the first reading operation.
In this embodiment, after obtaining data to be processed, the data to be processed is file data written by a first executor, where the number of files corresponding to the file data is N times of the number of reducers, N is an integer greater than 1, the reducers are located in a second executor, and merge the data to be processed corresponding to the same Reducer to obtain merged data, where the number of the merged data is less than the number of files corresponding to the file data, and after receiving a first read operation sent by the second executor, the first read operation is used to read the merged data corresponding to the reducers, and the merged data corresponding to the reducers is sent according to the first read operation. Because the data to be processed corresponding to the same Reducer are merged and the number of the merged data is less than that of the files corresponding to the file data, the random reading times can be reduced, the shuffle read process is accelerated, and the disk IO (input/output) resource and network consumption are reduced.
Optionally, the number of merged data is equal to the number of reducers. Those skilled in the art will appreciate that when the merge process is normally completed, the number of merged data is equal to the number of reducers; if the merging processing is abnormal, merging part of the data to be processed, and keeping the rest of the data to be processed unchanged, wherein the number of the merged data is larger than the number of reducers.
After the merging process is normally completed, in some embodiments, in step S202, merging the to-be-processed data corresponding to the same Reducer to obtain merged data, where the data processing method may further include: and updating the data to be processed. That is, in the merging process, the data to be processed needs to be kept unchanged, the updating operation on the data is not executed, and the data to be processed after the merging process is completed is updated only, so that the inconsistency of the data caused by the abnormal phenomenon is prevented.
Optionally, each merged datum corresponds to a block identifier of a corresponding Reducer. In this way, the Reducer may map the corresponding merged data through the block identifier, that is, change the shuffle read interface, so as to adapt to the structure of the ESS provided in the embodiment of the present application.
For example, referring to the application scenario shown in fig. 1, when the executor 3 (i.e., the second executor) reads the shuffle data, it needs to perform mapping according to the passed block identifier (BlockId). The original shuffle read interface is Seq (Block _ id1 … Block _ idn), and should be mapped to the merged Block id.
Further, the data to be processed may be metadata corresponding to at least one application. The metadata processing corresponding to different applications is independent and non-interfering.
Optionally, each application corresponds to a thread, and the life cycle of the thread is the same as that of the corresponding application. Namely, when the application is registered, the thread corresponding to the application is generated, and after the execution of the application is finished, the thread corresponding to the application is destroyed. The thread is used for regularly scanning data structures related to corresponding applications, and merging processing operation is carried out after new file data are formed.
In one specific implementation, the ESS maintains a thread pool, one for each newly registered application, and passes the identification (e.g., ID) of the application into the thread during the process of starting the thread; when the application execution is finished, the thread exits itself.
In some embodiments, the threads are driven by timers. The timing scan corresponds to the application-dependent data structure, and the merge process operation is performed each time new file data is formed. By using the timer as the drive, omission can be prevented compared with using the event as the drive.
Fig. 3 is a schematic flow chart of a data processing method according to another embodiment of the present application. On the basis of the foregoing embodiment, as shown in fig. 3, after S202 performs merging processing on to-be-processed data corresponding to the same Reducer to obtain merged data, the method in the embodiment of the present application may further include:
s301, monitoring whether an exception occurs in the merging processing process.
If the situation that no abnormity occurs in the merging processing process is monitored, executing S203 and S204; and if the abnormal condition is detected in the merging processing process, executing S302 and S303.
And S302, receiving a second reading operation sent by a second actuator.
And the second reading operation is used for reading the data before merging corresponding to the Reducer.
And S303, sending the data before combination corresponding to the Reducer according to the second reading operation.
In the embodiment, considering that there may be an exception occurring in the merging process, the exception in the merging process is monitored, and when the exception occurs, the shuffle read process is performed according to the existing flow, so as to ensure that the corresponding application is normally completed.
The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.
Fig. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 4, a data processing apparatus 40 provided in the embodiment of the present application includes: an acquisition module 41, a processing module 42, a receiving module 43 and a sending module 44. The obtaining module 41, the receiving module 43 and the sending module 44 are respectively connected to the processing module 42.
An obtaining module 41, configured to obtain data to be processed. The data to be processed is file data written by the first actuator, wherein the number of files corresponding to the file data is N times of the number of reducers, N is an integer greater than 1, and the reducers are located in the second actuator.
And the processing module 42 is configured to perform merging processing on the to-be-processed data corresponding to the same Reducer to obtain merged data. And the number of the merged data is less than the number of the files corresponding to the file data.
And the receiving module 43 is configured to receive the first read operation sent by the second actuator. The first read operation is used to read the merged data corresponding to the Reducer.
And a sending module 44, configured to send the merged data corresponding to the Reducer according to the first read operation.
Optionally, the number of merged data is equal to the number of reducers. Those skilled in the art will appreciate that when the merge process is normally completed, the number of merged data is equal to the number of reducers; if the merging processing is abnormal, merging part of the data to be processed, and keeping the rest of the data to be processed unchanged, wherein the number of the merged data is larger than the number of reducers.
After the merge process is normally completed, in some embodiments, the processing module 42 may further be configured to: and updating the data to be processed. That is, in the merging process, the data to be processed needs to be kept unchanged, the updating operation on the data is not executed, and the data to be processed after the merging process is completed is updated only, so that the inconsistency of the data caused by the abnormal phenomenon is prevented.
Wherein, each merged data corresponds to the block identifier of the corresponding Reducer. In this way, the Reducer can be mapped to the corresponding merged data through the block identifier, that is, the shuffle read interface is changed, so as to adapt to the structure of the ESS provided in the embodiment of the present application.
Further, the data to be processed may be metadata corresponding to at least one application. The metadata processing corresponding to different applications is independent and non-interfering.
Optionally, each application corresponds to a thread, and the life cycle of the thread is the same as that of the corresponding application. Namely, when the application is registered, the thread corresponding to the application is generated, and after the execution of the application is finished, the thread corresponding to the application is destroyed. The thread is used for regularly scanning data structures related to corresponding applications, and merging processing operation is carried out after new file data are formed.
In one implementation, the processing module 42 maintains a thread pool, one for each newly registered application, and transmits an identification (e.g., ID) of the application to the thread during the process of starting the thread; when the application execution is finished, the thread exits itself.
In some embodiments, the threads are driven by timers. The timing scan corresponds to the application-dependent data structure, and the merge process operation is performed each time new file data is formed. By using the timer as the drive, omission can be prevented compared with using the event as the drive.
Fig. 5 is a schematic structural diagram of a data processing apparatus according to another embodiment of the present application. As shown in fig. 5, the data processing apparatus 50 may further include, on the basis of the structure shown in fig. 4:
and a monitoring module 51, configured to monitor whether an exception occurs in the merging process. The monitoring module 51 is connected to the processing module 42.
Correspondingly, the receiving module 43 may be further configured to receive a second reading operation sent by the second actuator when the monitoring module 50 monitors that an abnormality occurs in the merging process. Wherein the second reading operation is to read the data before merging corresponding to the Reducer.
The sending module 44 may be further configured to send, according to the second reading operation, data before merging corresponding to the Reducer.
Because there may be an exception occurring in the merging process, the exception in the merging process is monitored by the monitoring module, and when the exception occurs, the shuffle read process is performed according to the existing flow, so that the normal completion of the corresponding application is ensured.
The data processing apparatus provided in the embodiment of the present application may be configured to execute the data processing method described in any of the above embodiments of the present application, and the implementation principle and the technical effect are similar, which are not described herein again.
Fig. 6 is a schematic structural diagram of a data processing apparatus according to yet another embodiment of the present application. As shown in fig. 6, the data processing apparatus 60 provided in the embodiment of the present application may include: a memory 61 and a processor 62. Wherein,
a memory 61 for storing program instructions.
The processor 62 is configured to call and execute the program instruction stored in the memory 61, and when the processor 62 executes the program instruction, the data processing apparatus 60 implements the data processing method described in any one of the method embodiments, which is similar in implementation principle and technical effect and is not described herein again.
An embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, the instructions cause the computer to execute the data processing method according to any of the embodiments of the present application, and the implementation principle and the technical effect are similar, and are not described herein again.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (14)

1.一种数据处理方法,其特征在于,包括:1. a data processing method, is characterized in that, comprises: 获取待处理数据,所述待处理数据是由第一执行器写入的文件数据,所述文件数据对应的文件个数为Reducer的个数的N倍,N为大于1的整数,所述Reducer位于第二执行器中;Obtain data to be processed, the data to be processed is file data written by the first executor, the number of files corresponding to the file data is N times the number of Reducers, N is an integer greater than 1, the Reducer in the second actuator; 对同一Reducer对应的所述待处理数据进行合并处理,得到合并后的数据,所述合并后的数据的个数少于所述文件数据对应的文件个数;Merging the to-be-processed data corresponding to the same Reducer to obtain merged data, where the number of the merged data is less than the number of files corresponding to the file data; 接收所述第二执行器发送的第一读取操作,所述第一读取操作用于读取所述Reducer对应的合并后的数据;receiving a first read operation sent by the second executor, where the first read operation is used to read the combined data corresponding to the Reducer; 根据所述第一读取操作,发送所述Reducer对应的合并后的数据。According to the first read operation, the combined data corresponding to the Reducer is sent. 2.根据权利要求1所述的方法,其特征在于,所述合并后的数据的个数等于所述Reducer的个数。2 . The method according to claim 1 , wherein the number of the combined data is equal to the number of the Reducers. 3 . 3.根据权利要求2所述的方法,其特征在于,每个合并后的数据分别与相应的Reducer的块标识对应。3 . The method according to claim 2 , wherein each merged data corresponds to a block identifier of a corresponding Reducer, respectively. 4 . 4.根据权利要求2所述的方法,其特征在于,所述对同一Reducer对应的所述待处理数据进行合并处理,得到合并后的数据之后,还包括:4. The method according to claim 2, characterized in that, after merging the to-be-processed data corresponding to the same Reducer to obtain the merged data, the method further comprises: 对所述待处理数据进行更新。The data to be processed is updated. 5.根据权利要求1所述的方法,其特征在于,所述待处理数据为至少一个应用对应的元数据。5. The method according to claim 1, wherein the data to be processed is metadata corresponding to at least one application. 6.根据权利要求5所述的方法,其特征在于,每个所述应用分别对应一个线程,所述线程的生命周期与对应应用的生命周期相同,所述线程用于定时扫描对应应用相关的数据结构,每当有新的文件数据形成之后进行合并处理操作。6. The method according to claim 5, wherein each of the applications corresponds to a thread, the life cycle of the thread is the same as the life cycle of the corresponding application, and the thread is used to periodically scan the corresponding application-related Data structure, merge processing operation is performed every time new file data is formed. 7.根据权利要求6所述的方法,其特征在于,所述线程采用定时器作为驱动。7. The method according to claim 6, wherein the thread is driven by a timer. 8.根据权利要求1所述的方法,其特征在于,还包括:8. The method of claim 1, further comprising: 若监测到合并处理过程中出现异常,则接收所述第二执行器发送的第二读取操作,所述第二读取操作用于读取所述Reducer对应的合并前的数据;If an abnormality is detected during the merge processing, receiving a second read operation sent by the second executor, where the second read operation is used to read the data before the merge corresponding to the Reducer; 根据所述第二读取操作,发送所述Reducer对应的合并前的数据。According to the second read operation, the unmerged data corresponding to the Reducer is sent. 9.一种数据处理装置,其特征在于,包括:9. A data processing device, comprising: 获取模块,用于获取待处理数据,所述待处理数据是由第一执行器写入的文件数据,所述文件数据对应的文件个数为Reducer的个数的N倍,N为大于1的整数,所述Reducer位于第二执行器中;The acquisition module is used to acquire the data to be processed, the data to be processed is the file data written by the first executor, the number of files corresponding to the file data is N times the number of Reducers, and N is greater than 1 Integer, the Reducer is located in the second executor; 处理模块,用于对同一Reducer对应的所述待处理数据进行合并处理,得到合并后的数据,所述合并后的数据的个数少于所述文件数据对应的文件个数;a processing module, configured to perform merging processing on the to-be-processed data corresponding to the same Reducer to obtain merged data, where the number of the merged data is less than the number of files corresponding to the file data; 接收模块,用于接收所述第二执行器发送的第一读取操作,所述第一读取操作用于读取所述Reducer对应的合并后的数据;a receiving module, configured to receive a first read operation sent by the second executor, where the first read operation is used to read the combined data corresponding to the Reducer; 发送模块,用于根据所述第一读取操作,发送所述Reducer对应的合并后的数据。A sending module, configured to send the combined data corresponding to the Reducer according to the first read operation. 10.根据权利要求9所述的装置,其特征在于,所述合并后的数据的个数等于所述Reducer的个数。10 . The apparatus according to claim 9 , wherein the number of the combined data is equal to the number of the Reducers. 11 . 11.根据权利要求10所述的装置,其特征在于,所述处理模块,还用于:11. The apparatus according to claim 10, wherein the processing module is further configured to: 在对同一Reducer对应的所述待处理数据进行合并处理,得到合并后的数据之后,对所述待处理数据进行更新。After the data to be processed corresponding to the same Reducer is merged to obtain the merged data, the data to be processed is updated. 12.根据权利要求9所述的装置,其特征在于,还包括:12. The apparatus of claim 9, further comprising: 监测模块,用于监测合并处理过程中是否出现异常;The monitoring module is used to monitor whether there is any abnormality during the merge processing; 对应地,所述接收模块,还用于在所述监测模块监测到合并处理过程中出现异常,接收所述第二执行器发送的第二读取操作,所述第二读取操作用于读取所述Reducer对应的合并前的数据;Correspondingly, the receiving module is further configured to receive a second read operation sent by the second executor when the monitoring module detects that an exception occurs during the merging process, and the second read operation is used to read Get the data before the merge corresponding to the Reducer; 所述发送模块,还用于根据所述第二读取操作,发送所述Reducer对应的合并前的数据。The sending module is further configured to send the unmerged data corresponding to the Reducer according to the second read operation. 13.一种数据处理装置,其特征在于,包括:存储器和处理器;13. A data processing device, comprising: a memory and a processor; 其中,所述存储器,用于存储程序指令;Wherein, the memory is used to store program instructions; 所述处理器,用于调用并执行所述程序指令,当所述处理器执行所述程序指令时,所述数据处理装置实现如权利要求1至8中任一项所述的方法。The processor is configured to call and execute the program instructions, and when the processor executes the program instructions, the data processing apparatus implements the method according to any one of claims 1 to 8 . 14.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行如权利要求1至8中任一项所述的方法。14. A computer-readable storage medium, characterized in that, instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, the computer is made to execute the method described in any one of claims 1 to 8. method described.
CN201910337777.XA 2019-04-25 2019-04-25 Data processing method, device and storage medium Active CN111857538B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910337777.XA CN111857538B (en) 2019-04-25 2019-04-25 Data processing method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910337777.XA CN111857538B (en) 2019-04-25 2019-04-25 Data processing method, device and storage medium

Publications (2)

Publication Number Publication Date
CN111857538A true CN111857538A (en) 2020-10-30
CN111857538B CN111857538B (en) 2025-02-21

Family

ID=72951209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910337777.XA Active CN111857538B (en) 2019-04-25 2019-04-25 Data processing method, device and storage medium

Country Status (1)

Country Link
CN (1) CN111857538B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023040348A1 (en) * 2021-09-14 2023-03-23 华为技术有限公司 Data processing method in distributed system, and related system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103023805A (en) * 2012-11-22 2013-04-03 北京航空航天大学 MapReduce system
US20130167151A1 (en) * 2011-12-22 2013-06-27 Abhishek Verma Job scheduling based on map stage and reduce stage duration
US20130219394A1 (en) * 2012-02-17 2013-08-22 Kenneth Jerome GOLDMAN System and method for a map flow worker
CN104767795A (en) * 2015-03-17 2015-07-08 浪潮通信信息系统有限公司 LTE MRO data statistical method and system based on HADOOP
US20160034205A1 (en) * 2014-08-01 2016-02-04 Software Ag Usa, Inc. Systems and/or methods for leveraging in-memory storage in connection with the shuffle phase of mapreduce
US20160366225A1 (en) * 2015-06-09 2016-12-15 Electronics And Telecommunications Research Institute Shuffle embedded distributed storage system supporting virtual merge and method thereof
CN107045422A (en) * 2016-02-06 2017-08-15 华为技术有限公司 Distributed storage method and equipment
CN107172149A (en) * 2017-05-16 2017-09-15 成都四象联创科技有限公司 Big data instant scheduling method
CN107204998A (en) * 2016-03-16 2017-09-26 华为技术有限公司 The method and apparatus of processing data
CN109388615A (en) * 2018-09-28 2019-02-26 智器云南京信息科技有限公司 Task processing method and system based on Spark

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130167151A1 (en) * 2011-12-22 2013-06-27 Abhishek Verma Job scheduling based on map stage and reduce stage duration
US20130219394A1 (en) * 2012-02-17 2013-08-22 Kenneth Jerome GOLDMAN System and method for a map flow worker
CN103023805A (en) * 2012-11-22 2013-04-03 北京航空航天大学 MapReduce system
US20160034205A1 (en) * 2014-08-01 2016-02-04 Software Ag Usa, Inc. Systems and/or methods for leveraging in-memory storage in connection with the shuffle phase of mapreduce
CN104767795A (en) * 2015-03-17 2015-07-08 浪潮通信信息系统有限公司 LTE MRO data statistical method and system based on HADOOP
US20160366225A1 (en) * 2015-06-09 2016-12-15 Electronics And Telecommunications Research Institute Shuffle embedded distributed storage system supporting virtual merge and method thereof
CN107045422A (en) * 2016-02-06 2017-08-15 华为技术有限公司 Distributed storage method and equipment
CN107204998A (en) * 2016-03-16 2017-09-26 华为技术有限公司 The method and apparatus of processing data
CN107172149A (en) * 2017-05-16 2017-09-15 成都四象联创科技有限公司 Big data instant scheduling method
CN109388615A (en) * 2018-09-28 2019-02-26 智器云南京信息科技有限公司 Task processing method and system based on Spark

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023040348A1 (en) * 2021-09-14 2023-03-23 华为技术有限公司 Data processing method in distributed system, and related system

Also Published As

Publication number Publication date
CN111857538B (en) 2025-02-21

Similar Documents

Publication Publication Date Title
CN108647104B (en) Request processing method, server and computer readable storage medium
US8112559B2 (en) Increasing available FIFO space to prevent messaging queue deadlocks in a DMA environment
US9582312B1 (en) Execution context trace for asynchronous tasks
US7698602B2 (en) Systems, methods and computer products for trace capability per work unit
US20160275123A1 (en) Pipeline execution of multiple map-reduce jobs
CN111064789B (en) Data migration method and system
CN109558260B (en) Kubernetes fault elimination system, method, equipment and medium
US10075326B2 (en) Monitoring file system operations between a client computer and a file server
CN113485840A (en) Multi-task parallel processing device and method based on Go language
US8631086B2 (en) Preventing messaging queue deadlocks in a DMA environment
WO2020232951A1 (en) Task execution method and device
US9507637B1 (en) Computer platform where tasks can optionally share per task resources
CN111858077A (en) Recording method, device and equipment for IO request log in storage system
WO2021104383A1 (en) Data backup method and apparatus, device, and storage medium
CN116069765A (en) Data migration method, device, electronic equipment and storage medium
CN119025188A (en) Function cold start optimization method, device, equipment and medium
CN103729166A (en) Method, device and system for determining thread relation of program
CN111857538A (en) Data processing method, device and storage medium
CN107943567B (en) High-reliability task scheduling method and system based on AMQP protocol
CN112199168A (en) Task processing method, device and system and task state interaction method
US20250173185A1 (en) Distributed task processing method, distributed system, and first device
CN107958414B (en) Method and system for eliminating long transactions of CICS (common integrated circuit chip) system
US12019909B2 (en) IO request pipeline processing device, method and system, and storage medium
US20090217290A1 (en) Method and System for Task Switching with Inline Execution
US20070067488A1 (en) System and method for transferring data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant