CN105955819B

CN105955819B - Hadoop-based data transmission method and system

Info

Publication number: CN105955819B
Application number: CN201610243294.XA
Authority: CN
Inventors: 曹政; 郭嘉梁; 李强
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2016-04-18
Filing date: 2016-04-18
Publication date: 2019-06-18
Anticipated expiration: 2036-04-18
Also published as: CN105955819A

Abstract

The invention discloses a kind of data transmission method and system based on Hadoop, this method comprises: intermediate result file production steps, establish an intermediate result file to store intermediate result caused by all Map tasks at any time；Establishment step is indexed, index file is established, which is updated according to the intermediate result file at any time；Transfer step actively sends the intermediate result that do not transmit to the Reduce task when judging to have the intermediate result that do not transmit in the intermediate result file according to the index file and corresponding Reduce task has been started up.Present invention reduces the execution times of Hadoop task, so that Map task and the degree of concurrence of Reduce task are higher.Resource utilization ratio is improved, the storage overhead of system is reduced.

Description

Data transmission method and system based on Hadoop

Technical field

The present invention relates to big data processing system field more particularly to a kind of data transmission method based on Hadoop and it is System.

Background technique

For a clearer understanding of the present invention, being carried out first to several nouns explained below:

Hadoop system: the distributed system infrastructure developed by Apache foundation, user can be not In the case where angle distribution formula low-level details, distributed program is developed.Hadoop realizes a distributed file system (Hadoop Distributed File System), abbreviation HDFS.

MapReduce Computational frame: the software frame of the parallel processing large data sets based on Hadoop distributed file system Frame constitutes the two big core components of Hadoop with HDFS.MapReduce Computational frame needs to run is appointed by the Map that user realizes Business and Reduce task.

Fig. 1 is Hadoop system flow chart of data processing schematic diagram.Each Map task can generate an intermediate result text Part, each intermediate result file include multiple regions, and the number in region is equal to the number of Reduce task.When Map task execution After complete, Reduce task can send the intermediate result of request Map task generation, execute Reduce task after merger sequence Calculation processing logic, finally write the result into HDFS.

Although this design method is used till today always in Hadoop system, shortcoming is still remained.

It in the above-mentioned methods, need to be from Reduce task to Map task requests intermediate result.Further, Reduce appoints Business could be applied obtaining intermediate result after the completion of needing passively to wait until Map task execution.Then in Map task execution process In, even if having produced a part of intermediate result, it can not be sent to Reduce task immediately and handled, can only wait, Thus reduce running efficiency of system.Meanwhile Internet resources can generate the free time because Map task is not completed, Reduce appoints Business is not as Map task is completed, and no input data is without can be carried out subsequent calculating.As it can be seen that aforesaid way makes this There is the neutral gear having to wait in the multiple portions of system.

Fig. 2 is the intermediate result storage mode schematic diagram of Map task in Hadoop system, and each Map task is respectively provided with Corresponding buffer circle.Each Map task (Map1, Map2, Map3) is first by intermediate result < keyword of generation, value, region > be stored in corresponding buffer circle, it is needed after buffer circle capacity reaches threshold value by the data in buffer circle It is ranked up according to region, the data in region are ranked up according to keyword, and the result deposit one after sequence is temporarily overflow Out in file.After the Map task execution, all spill files for needing to correspond to the Map task carry out merger row Sequence is merged into an intermediate result file.As it can be seen that each Map task respectively generates an intermediate result file.

The mode of this storage intermediate result, so that every server generates a large amount of intermediate result file.If cluster is advised Mould is 1000 nodes, needs to handle the data of 100T, will generate 1,000,000 Map tasks, per node on average just needs to transport 1000 Map tasks of row, therefore each node is likely to require and opens simultaneously the biography that 1000 intermediate result files carry out data It is defeated, so that system storage overhead is higher.

Patent document 1 (Publication No. CN102209087A) discloses a kind of in the data with storage network (SAN) The heart carry out the transmission of MapReduce data method, the data center include be deployed with Job Server, Map task server and The multiple servers of Reduce task server, this method comprises: in response to the Map task for receiving Job Server distribution, Map task server executes Map task and generates Map task output result；Map task is exported result by Map task server Write-in storage network；And the Reduce task in response to receiving Job Server distribution, Reduce task server is from depositing It stores up network and reads Map task output result.But the patent is absorbed in the network bandwidth advantage using storage network, accelerates intermediate As a result efficiency of transmission, but traditional mode by Reduce task requests intermediate result is still used, exist as described above Problem.

Summary of the invention

Present invention solves the technical problem that being, a kind of data transmission method and system based on Hadoop is provided, so that Intermediate result can be transmitted in the implementation procedure of Map task, improve the degree of concurrence of Map task and Reduce task.

Further, resource utilization ratio can be improved in the present invention, after Reduce task is executed as early as possible Continuous calculating logic, so that computing resource and Internet resources are fully utilized.

Further, the present invention only generates an intermediate result file on every server, reduces in the transmission Between result when, the file number of opening, reduce system storage overhead.

To solve the above-mentioned problems, the invention discloses a kind of data transmission methods based on Hadoop, this method comprises:

Intermediate result file production steps establish an intermediate result file stored produced by all Map tasks at any time Intermediate result；

Establishment step is indexed, index file is established, which is updated according to the intermediate result file at any time；

Transfer step, when judged to exist in the intermediate result file according to the index file intermediate result do not transmitted and When corresponding Reduce task has been started up, the intermediate result that do not transmit actively is sent to the Reduce task.

The index file is directed to each intermediate result and an index is arranged, which records corresponding Reduce task The deviation post of information, transmission flag bit and intermediate result in the intermediate result file.

The transfer step further comprises: after the Reduce Mission Success receives the intermediate result, in this Between transmission flag bit in index corresponding to result be updated to transmit.

The intermediate result file production steps further comprise:

Step 10, each Map task stores generated intermediate result in the buffer；

Step 20, after the usage amount when the buffer area reaches buffer threshold, all centres for will being stored in the buffer area As a result merger sequence is carried out, the intermediate result after sequence is stored into interim spill file；

Step 30, after interim spill file number reaches spill file threshold value, in all interim spill files Between result carry out merger sequence, by after sequence intermediate result store into the intermediate result file.

The intermediate result includes: keyword, value and region, and the merger sequence in the step 20 and the step 30 is equal Are as follows: it is ranked up according to region, is ranked up in region according to keyword.

The invention also discloses a kind of data transmission system based on Hadoop, which includes:

Intermediate result file generating unit stores all Map task institutes for establishing an intermediate result file at any time The intermediate result of generation；

Index establishes unit, for establishing index file, updates the index file at any time according to the intermediate result file；

Transmission unit, for judging in the intermediate result file when according to the index file in the presence of the intermediate knot not transmitted Fruit and when corresponding Reduce task has been started up, actively sends the intermediate result that do not transmit to the Reduce task.

The index file is directed to each intermediate result and an index is arranged, which records corresponding Reduce and appoint Deviation post of the information, transmission flag bit and intermediate result of being engaged in the intermediate result file.

The transmission unit further comprises a updating unit, for receiving centre knot when the Reduce Mission Success After fruit, the transmission flag bit in index corresponding to the intermediate result is updated to transmit.

The intermediate result file generating unit further comprises:

First storage unit, for generated intermediate result to be stored in the buffer to each Map task；

Second storage unit after reaching buffer threshold for the usage amount when the buffer area, will store in the buffer area All intermediate results carry out merger sequence, by after sequence intermediate result store into interim spill file；

Third storage unit, for after interim spill file number reaches spill file threshold value, to all interim spillings Intermediate result in file carries out merger sequence, and the intermediate result after sequence is stored into the intermediate result file.

The intermediate result includes: keyword, value and region, in second storage unit and the third storage unit Merger sequence is equal are as follows: is ranked up according to region, is ranked up in region according to keyword.

The invention also discloses a kind of distributed file systems, including the data transmission system based on Hadoop.

The beneficial effects of the present invention are:

1. shorten the execution time of Hadoop task: finger daemon is in Map task execution by concurrent between raising subtask Intermediate result is transmitted in the process, so that Map task and the degree of concurrence of Reduce task are higher.

2. improving resource utilization ratio: it can be carried out the transmission of intermediate result during Map task execution, Reduce task can also execute subsequent calculating logic as soon as possible, so that computing resource and Internet resources have obtained sufficient benefit With.

3. reducing system storage overhead: only generating an intermediate result file on every server, reduce and transmitting When intermediate result, the file number of opening.

4, the multi-level buffer of the intermediate result and the storage mode of multiple merger sequence through the invention, so that finally Data in deposit intermediate result file realize ordering, on this basis, can need to be sent to same Reduce in collection It is unified that the predetermined intermediate result is sent after the predetermined intermediate result of task, to improve efficiency of transmission.

Detailed description of the invention

Fig. 1 show Hadoop system flow chart of data processing schematic diagram；

Fig. 2 show the intermediate result storage mode schematic diagram of Map task in Hadoop system；

Fig. 3 show a kind of flow chart of data transmission method based on Hadoop of the invention；

Fig. 4 A show the flow chart of the method for Map task storage intermediate result of the invention；

Fig. 4 B show the structural schematic diagram of index file of the invention；

Fig. 5 show the specific flow chart of the data transmission method of the invention based on Hadoop；

Fig. 6 show the flow chart that Reduce task of the invention receives intermediate result.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to of the invention Data transmission method and system based on Hadoop are illustrated.It should be appreciated that specific embodiment described herein is only used To explain the present invention, it is not intended to limit the present invention.

In order to reduce the waiting time of Reduce task, the degree of concurrence of Map task and Reduce task is improved, is dropped simultaneously Low system storage overhead, the mode that the present invention generates an intermediate result file to Map task each in background technique carry out It improves, is revised as all Map tasks for running in same server and carries out intermediate result using same intermediate result file Storage, cooperate the use of index file, so that intermediate result can be transmitted in the implementation procedure of Map task, in Between result continuous generation and active push gives Reduce task, to improve the degree of concurrence of Map task Yu Reduce task, Improve resource utilization ratio.

It is illustrated in figure 3 a kind of flow chart of data transmission method based on Hadoop of the invention, is included the following steps:

Specifically, the present invention passes through caused by all Map tasks in same intermediate result document storage server Intermediate result.With the continuous execution of Map task, generated intermediate result can be according to generated time sequencing successively It stores into intermediate result file.

In another optimal enforcement example, the storage of intermediate result can also be carried out as follows.It is as shown in Figure 4 A The method of Map task storage intermediate result of the invention, including the following steps:

Step 410, intermediate result is stored to buffer area: intermediate result < keyword of generation is worth, region by Map task > storage is in the buffer.

Step 420, intermediate result is stored to spill file:, will after the usage amount when buffer area reaches buffer threshold All intermediate results in the buffer area carry out merger sequence, and the intermediate result after sequence is stored in interim spill file, Merger sequence is ranked up first, in accordance with " region " this field in intermediate result, and the intermediate result in the same area is pressed According to " keyword ", this field is ranked up.

Step 430, intermediate result is stored to intermediate result file: overflows text when interim spill file number reaches interim After part threshold value, merger sequence is carried out to the intermediate result in all interim spill files, the intermediate result after sequence is added Into intermediate result file.Merger sequence is still to be ranked up first, in accordance with " region " this field in intermediate result, According to " keyword ", this field is ranked up intermediate result in the same area.

It can be seen that with the continuous execution of Map task, intermediate result is endlessly generated, and these intermediate results By way of above-mentioned multi-level buffer and multiple merger sequence, endlessly storage enters intermediate result file.

The present invention has also set up an index file, which is used for as every intermediate knot in the intermediate result file Fruit establishes index.It therefore, is then during this is newly-increased in the index file whenever increasing an intermediate result in intermediate result file newly Between result is corresponding increases an index.

It is as shown in Figure 4 B the structural schematic diagram of the index file.

Every index records: corresponding Reduce mission bit stream, transmission flag bit, intermediate result are in intermediate result text Deviation post in part.

Specifically, since intermediate result file includes multiple regions, each region corresponds to a Reduce task, so Pass through " region " this field documented in intermediate result, so that it may determine corresponding Reduce mission bit stream.In Fig. 4 B Shown, Reduce process label is the corresponding Reduce mission bit stream.

Whether the intermediate result that transmission flag bit is used to mark this index corresponding has been transferred to Reduce task.Then When entering intermediate result file in every intermediate result, and generating an index accordingly, transmission flag bit is disposed as " not Transmission ".In the present embodiment, it " can not be transmitted " by " 0 " expression, " 1 " expression " transmission ".

Deviation post of the intermediate result in the intermediate result file represents the intermediate result in the intermediate result file In actual storage locations.This intermediate result then can be navigated in intermediate result file by the deviation post.

It is illustrated in figure 5 the specific flow chart of the data transmission method of the invention based on Hadoop.The present invention exists One finger daemon is installed in Hadoop, Fig. 5 the method is executed by the finger daemon.

Step 11, Map task is executed, the index file of intermediate result file and the intermediate result file, the rope are generated The structure of quotation part is as shown in Figure 4 B；

Step 12, finger daemon traverses the index file, checks the transmission flag bit of all intermediate results；

Step 13, when Map task executions all in server finish, and all intermediate results are to have transmitted, and are jumped to Step 18；Otherwise, step 14 is executed；

Step 14, if finger daemon discovery there is the intermediate result not yet transmitted and corresponding Reduce task has opened It is dynamic, then follow the steps 15；Otherwise, 12 are gone to step；

Step 15, finger daemon actively sends the intermediate result that do not transmit to Reduce task；

It guards and carries out finding the centre that transmission flag bit is 0 in intermediate result file for according to the guide of index file As a result, and sending it to Reduce task corresponding with Reduce process number；

Step 16, if finger daemon receives response message, 17 are thened follow the steps；Otherwise, 12 are gone to step；

Step 17, the transmission flag bit for the intermediate result transmitted in step 15 in index file is changed to by finger daemon Transmission, gos to step 12；

After finger daemon receives the response message from Reduce task, by the transmission mark of the intermediate result in index file Will position is labeled as 1；

Step 18, finger daemon exits.

Corresponding, the process that the Reduce task receives intermediate result is, described in the Reduce task reception The intermediate result that finger daemon is sent, and the message being properly received, the Reduce Mission Success are sent to the finger daemon After having received all intermediate results, then carry out subsequent calculating.It is specific as shown in Figure 6, including the following steps:

Step 21, Reduce task collects intermediate result；

Step 22,24 are gone to step if Reduce Mission Success receives intermediate result,；Otherwise, step 23 is executed；

Step 23, Reduce task does not do response；

Step 24, Reduce task returns to the finger daemon for sending intermediate result and is properly received message；

Step 25, if Reduce Mission Success has received all intermediate results, 26 are thened follow the steps；Otherwise, step is jumped to Rapid 21；

Step 26, Reduce task carries out subsequent calculating.

Based on the above content as it can be seen that the beneficial effects of the present invention are:

4, the storage mode to be sorted by the multi-level buffer of intermediate result described in Fig. 4 A and multiple merger, so that finally depositing Entering the data in intermediate result file realizes ordering, and on this basis, Fig. 5 step 15 can need to be sent to same in collection It is unified that the predetermined intermediate result is sent after the predetermined intermediate result of Reduce task, to improve efficiency of transmission.

The present invention is described in detail above, specific case used herein is to the principle of the present invention and embodiment party Formula is expounded, and the above description of the embodiment is only used to help understand the method for the present invention and its core ideas；Meanwhile it is right In those of ordinary skill in the art, according to the thought of the present invention, change is had in specific embodiments and applications Place, in conclusion the contents of this specification are not to be construed as limiting the invention.

Claims

1. a data transmission method based on Hadoop, is characterized in that, the method comprises:

The step of generating the intermediate result file is to create an intermediate result file to store the intermediate results generated by all the Map tasks at any time; it specifically includes: each Map task stores the generated intermediate results in a buffer; when the usage of the buffer reaches After the buffer threshold, all intermediate results stored in the buffer are merged and sorted, and the sorted intermediate results are stored in the temporary overflow file; when the number of temporary overflow files reaches the overflow file threshold, all temporary overflow files are sorted. Merge and sort the intermediate results of , and store the sorted intermediate results in the intermediate result file;

The index establishment step is to establish an index file, the index file sets an index for each intermediate result, and the index records the corresponding Reduce task information, transmission flag bit and the offset position of the intermediate result in the intermediate result file. The intermediate result file updates the index file at any time;

In the transmission step, when it is determined according to the index file that there is an untransmitted intermediate result in the intermediate result file and the corresponding Reduce task has been started, actively send the untransmitted intermediate result to the Reduce task.

2. The method of claim 1, wherein the transmitting step further comprises:

After the Reduce task successfully receives the intermediate result, the transmission flag bit in the index corresponding to the intermediate result is updated to be transmitted.

3. The method of claim 1, wherein the intermediate result comprises: a keyword, a value and a region, and the merge sort in the step of generating the intermediate result file is: sorting according to the region, and performing the sorting according to the keyword in the region sort.

4. a data transmission system based on Hadoop, is characterized in that, this system comprises:

The intermediate result file generating unit is used to establish an intermediate result file to store the intermediate results generated by all Map tasks at any time; it includes a first storage unit, a second storage unit and a third storage unit, wherein the first storage unit is used to store the intermediate results. Each of the Map tasks stores the generated intermediate results in a buffer; the second storage unit is used to merge and sort all the intermediate results stored in the buffer when the usage of the buffer reaches the buffer threshold , and store the sorted intermediate results in the temporary overflow file; the third storage unit is used to merge and sort the intermediate results in all temporary overflow files when the number of temporary overflow files reaches the overflow file threshold, and sort the sorted intermediate results. The intermediate result is stored in the intermediate result file;

An index building unit is used to build an index file. The index file sets an index for each intermediate result, and the index records the corresponding Reduce task information, the transmission flag bit, and the offset position of the intermediate result in the intermediate result file. Update the index file at any time according to the intermediate result file;

The transmitting unit is configured to actively send the untransmitted intermediate result to the Reduce task when it is determined according to the index file that there is an untransmitted intermediate result in the intermediate result file and the corresponding Reduce task has been started.

5 . The system according to claim 4 , wherein the transmitting unit further comprises an updating unit, which is used for transmitting the index corresponding to the intermediate result after the Reduce task successfully receives the intermediate result. 6 . The flag bits are updated to have been transmitted.

6. The system of claim 4, wherein the intermediate result comprises: a keyword, a value and a region, and the merge sort in the second storage unit and the third storage unit are: sorting according to the region, Sort by keyword within the area.

7. A distributed file system, comprising the Hadoop-based data transmission system according to any one of claims 4 to 6.