CN103345514B

CN103345514B - Streaming data processing method under big data environment

Info

Publication number: CN103345514B
Application number: CN201310287554.XA
Authority: CN
Inventors: 东方; 罗军舟; 张毅; 王宇翔; 徐晓冬
Original assignee: Southeast University; Focus Technology Co Ltd
Current assignee: Southeast University; Focus Technology Co Ltd
Priority date: 2013-07-09
Filing date: 2013-07-09
Publication date: 2016-06-08
Anticipated expiration: 2033-07-09
Also published as: CN103345514A

Abstract

The present invention discloses the streaming data processing method under a kind of big data environment, relate generally to the improvement of MapReduce computation module, specifically comprise: the localized irredundant of data is deposited and treatment mechanism, allow each computing node only store and process corresponding interval in data; Dispatch Map and Reduce related linear program in pipelined fashion with speed up processing; The internal memory of intermediate result deposits mechanism, in order to ensure data localization and effective enforcement of streamline, it is provided that at a high speed, memory access patterns easily. By above three modules, ensure data to flow to reliability and the high efficiency of row relax, meets the demand of data processing in practical application under big data environment.

Description

Streaming data processing method in big data environment

技术领域technical field

本发明涉及云计算，主要涉及有大数据处理，流式数据处理，具体到实际应用中的一种大数据环境下的流式数据处理方法。The present invention relates to cloud computing, mainly relates to big data processing, streaming data processing, and specifically to a streaming data processing method in a big data environment in practical applications.

背景技术Background technique

近年来，随着门户网站、社交网络、电子商务等网络应用的高速发展，以及业务的持续增长和延伸，产生并累积了大量的业务数据，这些数据具有数据总量大，数据结构多样化，数据增长率高等特点，是典型的大数据。In recent years, with the rapid development of network applications such as portal websites, social networks, and e-commerce, as well as the continuous growth and extension of business, a large amount of business data has been generated and accumulated. These data have a large amount of data and diverse data structures. The characteristics of high data growth rate are typical big data.

另一方面，用户持续不断地访问和使用这些网络应用，以获得所需的服务，形成了一系列实时的流式数据。为了满足用户的实时服务需求，网络应用不仅需要对大量的历史数据进行分析处理，还需要进一步对实时流式数据做出快速的处理。这样一种在大数据基础上针对这些流式数据做出快速处理的应用场景，是一种典型的大数据服务。On the other hand, users continuously access and use these network applications to obtain required services, forming a series of real-time streaming data. In order to meet the real-time service requirements of users, network applications not only need to analyze and process a large amount of historical data, but also need to further quickly process real-time streaming data. Such an application scenario of fast processing of streaming data based on big data is a typical big data service.

该类大数据服务对比一般的数据服务，有着比较特殊的性质：首先，业务数据为大数据，新到来的流式数据规模小，结构简单；其次，数据流持续到达，业务数据持续增长，定期更新；最后，须在大数据之上对流式数据做出快速处理，以提供高效服务。Compared with general data services, this type of big data service has special properties: first, the business data is big data, and the newly arrived streaming data is small in scale and simple in structure; second, the data flow continues to arrive, business data continues to grow, and Update; Finally, fast processing of streaming data must be done on top of big data to provide efficient services.

例如，在电子商务实时推荐系统中，系统中存储有大量商品的信息，用户的注册、搜索、收藏、购买等记录信息，称之为历史数据；同时随着大量用户的实时访问，又产生了持续到达的实时服务请求数据。为了能够实现对用户的实时推荐，不仅需要针对历史数据做出分析，还需要针对实时流式数据进行快速处理，只有两者相结合才能实现有效的实时推荐。因此，需要研究大数据环境下的实时流式数据处理技术，以提供对此类大数据应用服务的支持。For example, in the e-commerce real-time recommendation system, there are a large number of commodity information stored in the system, and the user's registration, search, collection, purchase and other record information are called historical data; at the same time, with the real-time access of a large number of users, a Continuously arriving real-time service request data. In order to achieve real-time recommendations for users, not only historical data needs to be analyzed, but also real-time streaming data needs to be processed quickly. Only the combination of the two can achieve effective real-time recommendations. Therefore, it is necessary to study the real-time streaming data processing technology in the big data environment to provide support for such big data application services.

为了实现对大数据的处理，Google最早提出了MapReduce批量处理方法，这类基于MapReduce的框架的数据处理方法主要面向批量数据处理，不支持流式数据的处理。In order to realize the processing of big data, Google first proposed the MapReduce batch processing method. This type of data processing method based on the MapReduce framework is mainly for batch data processing and does not support the processing of streaming data.

在流式数据处理方面，偏向于特定的应用环境。例如，S4（SimpleScalableStreamingSystem）是一个受MapReduce启发的分布式流式数据处理系统，主要用于解决搜索，错误探测，网络交友等现实应用。为避免了系统的复杂性，S4只面向于流式数据处理。In terms of streaming data processing, it is biased towards specific application environments. For example, S4 (SimpleScalableStreamingSystem) is a distributed streaming data processing system inspired by MapReduce, which is mainly used to solve practical applications such as search, error detection, and online dating. In order to avoid the complexity of the system, S4 is only for streaming data processing.

目前，在大数据环境下的流式数据处理方面缺少有效方法。At present, there is a lack of effective methods for streaming data processing in the big data environment.

发明内容Contents of the invention

技术问题：从上可以看出，目前缺少有效的大数据环境下流式数据处理系统，本发明基于MapReduce以构建高效实时的流式数据处理框架，主要包括三个模块：数据的本地化无冗余存放与处理机制，流水线的方式调度Map和Reduce相关线程。数据/中间结果的内存存放机制。Technical problem: It can be seen from the above that there is currently a lack of an effective streaming data processing system in a big data environment. The present invention is based on MapReduce to build an efficient and real-time streaming data processing framework, which mainly includes three modules: data localization without redundancy The storage and processing mechanism schedules Map and Reduce related threads in a pipelined manner. Memory storage mechanism for data/intermediate results.

技术方案：本发明的一种大数据环境下的流式数据处理方法包括以下步骤：Technical solution: a streaming data processing method in a big data environment of the present invention comprises the following steps:

1）：处理累积的大数据即历史数据生成中间结果集，划分该结果集并分布缓存到各计算节点；1): Process the accumulated big data, that is, historical data to generate an intermediate result set, divide the result set and distribute the cache to each computing node;

2）：每个计算节点定时地接受全部的流式数据，通过Map处理得到中间结果；2): Each computing node regularly receives all streaming data, and obtains intermediate results through Map processing;

3）：通过中间结果划分方法过滤得到该节点的中间结果，缓存于本地节点上，达到阈值10,000条后形成一个分片，发送该分片；3): The intermediate results of the node are filtered through the intermediate result division method, cached on the local node, and a fragment is formed after reaching the threshold of 10,000, and the fragment is sent;

4）：当中间结果分片到达后，根据流水线调度算法，把历史数据中间结果同该中间结果分片一起作为Reduce输入；4): When the intermediate result fragments arrive, according to the pipeline scheduling algorithm, the historical data intermediate results and the intermediate result fragments are used as Reduce input;

5）：输出计算结果，该计算结果是一个任务不同时期部分输出，把所有的这些结果归并到同一个文件中，形成最终输出结果。5): Output calculation results, which are partial outputs of a task in different periods, and merge all these results into the same file to form the final output result.

所述的步骤1）中的累积的大数据即历史数据，都备份在分布式文件系统上，在系统启动或者开始计算任务前，需要读取这部分数据作预处理，并把它们分布式地存储到各个计算节点，为下面的计算做好准备。The accumulated big data in step 1), that is, historical data, is backed up on the distributed file system. Before the system starts or starts computing tasks, it is necessary to read this part of the data for preprocessing and distribute them Stored in each computing node, ready for the following calculations.

所述的步骤2）中每个计算节点定时地接受全部的流式数据，所有计算节点缓存这部分数据，通过处理生成中间结果；同时根据具体的历史数据备份方法，该部分流式数据需要定期更新至历史数据集。In step 2), each computing node regularly receives all streaming data, all computing nodes cache this part of data, and generate intermediate results through processing; at the same time, according to the specific historical data backup method, this part of streaming data needs to be periodically Update to historical dataset.

所述的步骤3）中的中间结果划分方法，每个节点对应一定的关键字值区间，每个节点只处理该区间内的数据，即对于全部接受并且已经生成中间结果的流式数据，过滤出该区间内的中间结果；并且，当达到指定阈值后，形成一个中间结果分片，发送该数据分片。In the intermediate result division method in step 3), each node corresponds to a certain range of key values, and each node only processes data within this range, that is, for all stream data that has been accepted and has generated intermediate results, filter Output the intermediate results within the interval; and, when the specified threshold is reached, an intermediate result fragment is formed and the data fragment is sent.

所述的步骤4）中间结果分片到达后，结合步骤1）的中间结果分组作为下一步Reduce任务的输入数据；同时，由于一个计算任务在步骤3）中一般都会产生多个分片，需要为每一个这样的Reduce任务派发一个线程，异步地进行计算。Step 4) After the intermediate result fragments arrive, combine the intermediate result groups of step 1) as the input data of the next Reduce task; at the same time, since a computing task generally generates multiple fragments in step 3), it is necessary to A thread is dispatched for each such Reduce task to perform calculations asynchronously.

所述的步骤5）输出计算结果，该结果是进行一次Reduce计算产生的计算结果；当所有Reduce任务完成后，合并这些Reduce计算结果，输出一个最终的结果文件。The step 5) outputs the calculation result, which is the calculation result generated by performing a Reduce calculation; after all the Reduce tasks are completed, these Reduce calculation results are combined to output a final result file.

有益效果：本发明的有效之处在于：Beneficial effect: the effectiveness of the present invention is:

通过在现有hadoop平台上增添三个模块，以支持大数据环境下的流式数据快速处理，可支持于电子商务实时推荐这类应用，By adding three modules to the existing hadoop platform to support fast processing of streaming data in a big data environment, it can support real-time recommendation of such applications in e-commerce,

该发明对比现有技术，具有以下优点：Compared with the prior art, the invention has the following advantages:

1、本发明通过整合静态的大数据处理技术与实时的流式数据处理技术，为介于两类应用间数据处理方法提供了参考；1. The present invention provides a reference for data processing methods between two types of applications by integrating static big data processing technology and real-time streaming data processing technology;

2、通过概率模型划分数据集，使得系统负荷趋于均衡，有效地增大了系统的吞吐率，为优化大数据的并行处理提供了新的思路；2. The data set is divided by the probability model, which makes the system load tend to be balanced, effectively increases the throughput of the system, and provides a new idea for optimizing the parallel processing of big data;

3、通过流水线化，使得数据分批处理，细化了数据处理的粒度，加快了计算速度，符合高响应比的任务需求；3. Through pipeline, the data is processed in batches, the granularity of data processing is refined, the calculation speed is accelerated, and it meets the task requirements of high response ratio;

4、通过增加内存管理，为以上两个模块的有效运行提供支持，并且增强了系统面对不同规模数据的适应能力。4. By adding memory management, it provides support for the effective operation of the above two modules, and enhances the adaptability of the system to data of different scales.

附图说明Description of drawings

图1为处理流程图，Figure 1 is a processing flow chart,

图2为系统架构图，Figure 2 is a system architecture diagram,

图3为内存结构图。Figure 3 is a memory structure diagram.

具体实施方式detailed description

大数据环境下的流式数据处理，以现有的hadoop——一种实现了MapReduce的框架——为基础，在现有功能中增加三个模块，以支持对流数据的处理。特别适合大量结构复杂的历史数据基础上的实时流数据处理，具体包括数据本地化模块，流水线调度模块和内存管理模块，具体实现方法如下：The streaming data processing in the big data environment is based on the existing Hadoop—a framework that implements MapReduce—and adds three modules to the existing functions to support the processing of streaming data. It is especially suitable for real-time streaming data processing based on a large amount of complex historical data, including data localization module, pipeline scheduling module and memory management module. The specific implementation methods are as follows:

在数据本地化模块中，一般通过Hash函数划分关键字值区间，达到各个节点无冗余存放数据，为了保证数据划分均衡性，采用基于概率统计的方法进行划分，使得数据基本服从均匀分布，主要执行以下步骤：In the data localization module, the key value interval is generally divided by the Hash function, so that each node can store data without redundancy. In order to ensure the balance of data division, the method based on probability statistics is used to divide the data, so that the data basically obeys the uniform distribution. Perform the following steps:

步骤1：随机收集部分历史数据或流式数据作为样本，以关键字排序，如果关键字为字符串，则以其编码值排序；Step 1: Randomly collect some historical data or streaming data as samples, sort by keywords, if the keywords are strings, sort by their encoded values;

步骤2：统计排序后的所有关键字出现的频率；Step 2: count the frequency of occurrence of all keywords after sorting;

步骤3：根据计算节点数N，得到每个节点理想负载因子，一般为节点数的倒数1/N；Step 3: According to the calculation of the number of nodes N, the ideal load factor of each node is obtained, which is generally the reciprocal 1/N of the number of nodes;

步骤4：根据关键字频率，从关键字列表中依次连续派发关键字到N个集合，尽量使得每个集合的频率和最接近负载因子；Step 4: According to the keyword frequency, continuously distribute keywords from the keyword list to N collections, try to make the frequency sum of each collection closest to the load factor;

步骤5：每个计算节点接收一个集合内的关键字，形成其关键字区间（集合）。Step 5: Each computing node receives keywords in a set to form its key range (set).

步骤6：各节点只接收或者处理关键字区间内的数据。Step 6: Each node only receives or processes data within the key range.

在流水线调度模块中，通过异步分发中间数据(map输出)和计算结果(reduce输出)来加快计算速度，需要根据系统负荷控制该异步过程。这里通过监测系统运行时参数，来调节分发速度和计算任务的派发。主要包括以下步骤：In the pipeline scheduling module, the calculation speed is accelerated by distributing intermediate data (map output) and calculation results (reduce output) asynchronously, and this asynchronous process needs to be controlled according to the system load. Here, the distribution speed and the distribution of computing tasks are adjusted by monitoring the system runtime parameters. It mainly includes the following steps:

步骤1：监测流式数据中间结果分片数目，即流式数据每次经过map处理后的中间结果缓存数；Step 1: Monitor the number of intermediate result fragments of the streaming data, that is, the number of intermediate result caches after the streaming data is processed by map each time;

步骤2：监测已分发出去的中间结果数目，即reduce阶段还未处理、处在缓存中的中间结果数；Step 2: Monitor the number of intermediate results that have been distributed, that is, the number of intermediate results that have not been processed in the reduce stage and are in the cache;

步骤3：统计系统已经分配的map线程数目，reduce线程数目以及平台给系统分配的最大线程数；Step 3: Count the number of map threads allocated by the system, the number of reduce threads, and the maximum number of threads allocated by the platform to the system;

得到这些参数后，系统通过以下步骤调节各个任务的执行速度，在保证资源不溢出的情况下，最大化任务执行速度：After obtaining these parameters, the system adjusts the execution speed of each task through the following steps to maximize the task execution speed while ensuring that resources do not overflow:

步骤1：在系统分配线程数目有限的情况下，若数据流到达速度较小，以致map任务较轻，可按步骤2执行，否则执行步骤3；Step 1: In the case of a limited number of threads allocated by the system, if the arrival speed of the data flow is small, so that the map task is light, you can follow step 2, otherwise go to step 3;

步骤2：若reduce阶段消费中间结果的速度快于map的生产速度，减小map中间结果缓存；否则，增加reduce阶段缓存,或者适当增加reduce线程数；Step 2: If the consumption of intermediate results in the reduce stage is faster than the production speed of the map, reduce the map intermediate result cache; otherwise, increase the reduce stage cache, or increase the number of reduce threads appropriately;

步骤3：尽量增加map线程数以保证数据不丢失，增大map缓存；Step 3: Increase the number of map threads as much as possible to ensure that data is not lost, and increase the map cache;

步骤4：返回步骤1。Step 4: Go back to Step 1.

一般，在完成轻量作业时，系统负荷较轻，线程数目比较充裕，可以取消这些缓存以及map及reduce任务派发限制，使系统达到快速响应数据处理。Generally, when completing light jobs, the system load is light and the number of threads is relatively sufficient. These caches and map and reduce task distribution restrictions can be canceled to enable the system to achieve fast response data processing.

在内存管理模块中，通过结合内存和外存(主要是磁盘)扩大存储空间，保证中间数据缓存的可扩展性，同时保证数据的快速查找读取。这种主要针对系统任务较重，数据量较大的情况，主要包括以下步骤：In the memory management module, the storage space is expanded by combining the memory and external memory (mainly disk), to ensure the scalability of the intermediate data cache, and to ensure the fast search and read of the data. This is mainly aimed at the situation where the system task is heavy and the amount of data is large, and it mainly includes the following steps:

步骤1：通过关键字Hash索引，建立中间结果索引区，并常驻内存，hash指针指向一个信息头结构；Step 1: Create an intermediate result index area through the keyword Hash index, and store it in memory, and the hash pointer points to an information header structure;

步骤2：在内存中建立中间结果缓存区，可根据内存大小配置大小；步骤1中的信息头在该缓存区中查找数据，并返回结果；Step 2: Create an intermediate result buffer in the memory, which can be configured according to the size of the memory; the information header in step 1 searches for data in the buffer and returns the result;

步骤3：当中间结果较大，内存空间不足，在外存建立存储区。外存区同样设有检索区和数据区，但不设立缓存区；Step 3: When the intermediate result is large and the memory space is insufficient, create a storage area in the external memory. The external storage area also has a retrieval area and a data area, but no cache area;

步骤4：当信息头无法在缓存区找到相应数据时，转去外存区查找。Step 4: When the information header cannot find the corresponding data in the cache area, go to the external storage area to search.

在实际系统中，产生的中间结果，一般为键值对。如果缓存空间尚有空间，直接存入该中间结果，并更新检索区表项，同时分配信息头；否则，数据存入外存，信息头指向外存。In the actual system, the intermediate results generated are generally key-value pairs. If there is still space in the cache space, directly store the intermediate result, update the entry in the retrieval area, and allocate the information header; otherwise, store the data in the external storage, and the information header points to the external storage.

一般应用中，历史数据集比较大，可以考虑把历史数据存入外存，建立内外存交换区；对于实时数据流，数据存放在内存中，这些具体技术已有现成的算法和数据结构。In general applications, the historical data set is relatively large, and it can be considered to store historical data in external storage and establish an internal and external storage swap area; for real-time data streams, data is stored in memory, and these specific technologies already have ready-made algorithms and data structures.

下面结合附图及具体实施方式对本发明再作进一步详细的说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

本发明基本处理流程如图1所示，同时本发明具有明显的层次结构，如图2所示。主要新增有本地化、流水线及内存管理三个模块，内存管理主要为其它两个模块服务，以保证其数据存储的可靠性，通过在hadoop框架下增加这三个模块，使得在支持大数据处理基础上，提供对数据流的支持，具体实施方法如下：The basic processing flow of the present invention is shown in FIG. 1 , and the present invention has an obvious hierarchical structure, as shown in FIG. 2 . There are mainly three new modules: localization, pipeline and memory management. Memory management mainly serves the other two modules to ensure the reliability of its data storage. By adding these three modules under the Hadoop framework, it supports big data On the basis of processing, it provides support for data flow, and the specific implementation method is as follows:

在数据本地化模块中，一般处理的数据为<key,value>键值对，通过key的字节码对其划分值区间，具体采用概率统计模型：在历史数据和部分数据流中随机抽取样本，分析键值对key的出现频率，划分区间使得键值基本分布均匀；划分完毕后，节点N_i对应键值区间S_i，假设节点数为N，区间划分算法如下：In the data localization module, the generally processed data are <key, value> key-value pairs, which are divided into value intervals by the bytecode of the key, and the probability statistics model is used specifically: samples are randomly selected from historical data and part of the data stream , analyze the occurrence frequency of the key-value pair key, divide the interval so that the key value is basically evenly distributed; after the division is completed, the node N _i corresponds to the key-value interval S _i , assuming that the number of nodes is N, the interval division algorithm is as follows:

区间划分完毕后，每个节点只处理对应区间内的<key,value>，这样的话，节点k在处理数据流中的项<key(x),value(y)>时，可以通过判断key(x)是否在区间array[k]内决定是否过滤该数据项。After the interval is divided, each node only processes <key, value> in the corresponding interval. In this case, when node k processes the item <key(x), value(y)> in the data stream, it can judge key( x) Determine whether to filter the data item in the interval array[k].

在流水线调度模块中，首先需要采集相关作业运行时参数信息，再根据这些参数来判断系统健康状态，进而通过控制这些参数来改变作业运行状态。In the pipeline scheduling module, it is first necessary to collect relevant job runtime parameter information, and then judge the system health status according to these parameters, and then change the job running status by controlling these parameters.

需要采集的信息有：The information to be collected is:

V_in(split/s)流式数据平均到达速率，V _in (split/s) average arrival rate of streaming data,

V_mc(split/s)map线程平均消费速率；V_mp(piece/s)map线程平均生产速率，V _mc (split/s) map thread average consumption rate; V _mp (piece/s) map thread average production rate,

V_rc(piece/s)reduce线程平均消费速率，V _rc (piece/s) reduce thread average consumption rate,

T_N作业分配的线程数；T_Mmap线程数；T_Rreduce线程数。The number of threads assigned by _TN jobs; the number of _TM map threads; the number of T _R reduce threads.

首先判断T_N是否大于一般作业所需线程数，如果大于则说明系统资源较充裕，可以在map产生中间结果形成一个分片后，直接发给reduce阶段进行处理；否者的话需要根据情况控制map或reduce线程数以及中间结果的缓存，具体算法如下：First judge whether T _N is greater than the number of threads required by general jobs. If it is greater, it means that the system resources are abundant. After the map generates an intermediate result to form a fragment, it can be directly sent to the reduce stage for processing; otherwise, the map needs to be controlled according to the situation. Or the number of reduce threads and the cache of intermediate results, the specific algorithm is as follows:

在内存管理模块中，通过重新组织中间结果的存放和访问方式，保证中间结果存放的高可靠性和透明性，如图3所示，首先通过hash树表对中间结果实现快速寻址，寻得信息头后，可通过信息头直接访问内存数据或者外存数据，信息头和hash表存储在内存中，中间数据部分存储在内存中，部分存储在外存中，内外存的页面（键值对记录）交换可以通过最近最少使用算法LRU算法完成。In the memory management module, by reorganizing the storage and access methods of the intermediate results, the high reliability and transparency of the storage of the intermediate results are ensured. As shown in Figure 3, firstly, the intermediate results are quickly addressed through the hash tree table, and the After the information header, you can directly access the memory data or external storage data through the information header. The information header and the hash table are stored in the memory, and the intermediate data is partially stored in the internal memory and partially stored in the external storage. The pages of the internal and external storage (key-value pair records ) exchange can be done by least recently used algorithm LRU algorithm.

Claims

1. A streaming data processing method under a big data environment, characterized in that the method comprises the following steps:

1): Process the accumulated big data, that is, historical data to generate an intermediate result set, divide the result set and distribute the cache to each computing node;

2): Each computing node regularly receives all streaming data, and obtains intermediate results through Map processing;

3): The intermediate results of the node are filtered through the intermediate result division method, cached on the local node, and a fragment is formed after reaching the threshold of 10,000, and the fragment is sent;

4): When the intermediate result fragments arrive, according to the pipeline scheduling algorithm, the historical data intermediate results and the intermediate result fragments are used as Reduce input;

5): Output calculation results, which are partial outputs of a task in different periods, and merge all these results into the same file to form the final output result.

2. the streaming data processing method under the big data environment according to claim 1, is characterized in that: the accumulated big data in the described step 1), i.e. historical data, is all backed up on the distributed file system, in Before the system starts or starts computing tasks, it needs to read this part of the data for preprocessing, and store them distributedly to each computing node to prepare for the following calculations.

3. The streaming data processing method under the big data environment according to claim 1, characterized in that: in the step 2), each computing node regularly accepts all streaming data, and all computing nodes cache all the streaming data The streaming data is processed to generate intermediate results; at the same time, according to the specific historical data backup method, all these streaming data need to be regularly updated to the historical data set.

4. the streaming data processing method under the big data environment according to claim 1, is characterized in that: in the intermediate result division method in described step 3), each node corresponds to a certain key value interval, each The node only processes the data in this range, that is, for all the streaming data that has been accepted and has generated intermediate results, filter out the intermediate results in this range; and, when the specified threshold is reached, an intermediate result fragment is formed and the intermediate result is sent The result is sliced.

5. The streaming data processing method under the big data environment according to claim 1, characterized in that: after the arrival of the intermediate result fragments in the step 4), the intermediate result grouping in combination with the step 1) is used as the next step of the Reduce task At the same time, since a calculation task generally generates multiple fragments in step 3), it is necessary to dispatch a thread for each such Reduce task to perform calculations asynchronously.

6. the streaming data processing method under the big data environment according to claim 1, is characterized in that: the output calculation result in described step 5), this result is the calculation result that carries out a Reduce calculation to produce; When all After the Reduce task is completed, these Reduce calculation results are combined to output a final result file.