CN102664789B

CN102664789B - The processing method of a kind of large-scale data and system

Info

Publication number: CN102664789B
Application number: CN201210102411.2A
Authority: CN
Inventors: 贺艳军; 李婷婷; 周宇; 石婧岚
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-04-09
Filing date: 2012-04-09
Publication date: 2016-08-17
Anticipated expiration: 2032-04-09
Also published as: CN102664789A

Abstract

The present invention provides a large-scale data processing system and method, wherein the system includes: a flow collection subsystem and a flow processing subsystem; the flow collection subsystem is used to collect data flow, and mirror the collected data flow , and divide the obtained mirrored traffic into P-way sub-traffic and send it to the traffic storage cluster in the traffic processing subsystem, where P is an integer greater than 1; the traffic storage cluster is composed of M storage servers, and each storage server has Hang N disks, M is a positive integer, N is an integer greater than 1, and M×N≥P; each storage server receives the sub-traffic distributed, and uses load balancing technology to write the sub-traffic distributed to N disks attached to it. In this way, the pressure of continuous writing on the disk is reduced, and the problem of large-scale data storage is better solved.

Description

A large-scale data processing method and system

【技术领域】【Technical field】

本发明涉及计算机网络技术，特别涉及一种大规模数据的处理方法和系统。The invention relates to computer network technology, in particular to a large-scale data processing method and system.

【背景技术】【Background technique】

随着网络用户的不断扩大，Internet上的数据量成爆炸性增长，人们对网络的传输速度、数据的安全及可靠性有了新的认识。用户的数据广泛地分布在很多地方，对用户来说没有被完善管理的数据存储和备份使商务运作存在着隐含的危险，数据传输的速度和质量影响着用户体验，另外随着云服务的逐渐兴起和推广，大规模数据的存储、统计或分析等处理需求成为亟待解决的问题。然而，现有的数据处理系统和方法受限于性能的影响无法满足大规模数据的处理需求，例如现有数据处理系统和方法如果直接应用于大规模数据的存储，则会带来无法承受的数据读写压力。With the continuous expansion of network users, the amount of data on the Internet has grown explosively, and people have a new understanding of network transmission speed, data security and reliability. User data is widely distributed in many places. For users, data storage and backup that are not well managed pose hidden dangers to business operations. The speed and quality of data transmission affect user experience. In addition, with the development of cloud services Gradually rising and popularizing, the processing needs of large-scale data storage, statistics or analysis have become urgent problems to be solved. However, the existing data processing systems and methods cannot meet the processing requirements of large-scale data due to the impact of performance. For example, if the existing data processing systems and methods are directly applied to the storage of large-scale data, it will bring unbearable Data read and write pressure.

【发明内容】【Content of invention】

本发明提供了一种大规模数据的处理方法和系统，以便满足大规模数据的处理需求。The invention provides a large-scale data processing method and system, so as to meet the large-scale data processing requirements.

具体技术方案如下：The specific technical scheme is as follows:

一种大规模数据的处理系统，该系统包括：流量采集子系统和流量处理子系统；A large-scale data processing system, the system includes: a flow acquisition subsystem and a flow processing subsystem;

所述流量采集子系统，用于采集数据流量，将采集到的数据流量进行镜像，并将得到的镜像流量分流为P路子流量发送至所述流量处理子系统中的流量存储集群，P为大于1的整数；The traffic acquisition subsystem is used to collect data traffic, mirror the collected data traffic, and divide the obtained mirrored traffic into P-way sub-traffic and send it to the traffic storage cluster in the traffic processing subsystem, where P is greater than an integer of 1;

所述流量存储集群由M台存储服务器组成，每台存储服务器下挂N个磁盘，所述M为正整数，N为大于1的整数，且M×N≥P；每台存储服务器接收分流到的子流量，并采用负载均衡技术将分流到的子流量写入下挂的N个磁盘。The traffic storage cluster is composed of M storage servers, and each storage server is connected with N disks, the M is a positive integer, and N is an integer greater than 1, and M×N≥P; each storage server receives and distributes traffic to sub-traffic, and use load balancing technology to write the split sub-traffic to the N disks attached to it.

根据本发明一优选实施例，所述流量采集子系统包括：According to a preferred embodiment of the present invention, the flow collection subsystem includes:

用于采集外网核心交换机出口的数据流量并对采集到的数据流量进行镜像的流量采集单元，以及A traffic collection unit for collecting data traffic at the egress of the external network core switch and mirroring the collected data traffic, and

用于采用负载均衡技术将镜像流量分流为各子流量的分流处理单元。A split processing unit for splitting mirrored traffic into sub-traffic by using load balancing technology.

根据本发明一优选实施例，所述流量采集单元由分光器和光放大器组成；According to a preferred embodiment of the present invention, the flow collection unit is composed of an optical splitter and an optical amplifier;

所述分光器对外网核心交换机出口的数据流量进行分光处理，所述光放大器对分光处理后的数据流量进行光放大形成镜像流量。The optical splitter performs optical splitting processing on the data traffic at the exit of the core switch of the external network, and the optical amplifier performs optical amplification on the data traffic after optical splitting processing to form a mirrored traffic.

根据本发明一优选实施例，所述分流处理单元为分流交换机，采用trunk的方式将镜像流量采用负载均衡技术分流成P路子流量。According to a preferred embodiment of the present invention, the distribution processing unit is a distribution switch, which uses a trunk method to distribute mirrored traffic into P paths of sub-traffic using load balancing technology.

根据本发明一优选实施例，每台存储服务器上运行多个进程，每个进程分别对应所述N个磁盘中的部分磁盘，每个进程分别负责接收一部分子流量以及将接收到的所述部分子流量按预设的时间长度为单位依次轮流写入对应磁盘。According to a preferred embodiment of the present invention, multiple processes run on each storage server, each process corresponds to a part of the N disks, and each process is responsible for receiving a part of the sub-traffic and the part of the received The sub-flows are written to the corresponding disk in turn according to the preset time length.

根据本发明一优选实施例，所述流量处理子系统还包括实时分析集群；According to a preferred embodiment of the present invention, the traffic processing subsystem further includes a real-time analysis cluster;

所述流量采集子系统将采集到的数据流量进行镜像得到两路镜像流量，其中一路镜像流量用于执行所述分流处理，另一路镜像流量被发送至所述实时分析集群；The traffic acquisition subsystem mirrors the collected data traffic to obtain two mirrored traffics, wherein one mirrored traffic is used to perform the split processing, and the other mirrored traffic is sent to the real-time analysis cluster;

所述实时分析集群，用于对接收到的镜像流量进行流量信息的统计，并利用统计结果生成分析文件。The real-time analysis cluster is used to perform statistics on the traffic information of the received mirrored traffic, and use the statistical results to generate analysis files.

根据本发明一优选实施例，所述实时分析集群包括：由服务器集群组成的实时接收模块和汇总统计模块；According to a preferred embodiment of the present invention, the real-time analysis cluster includes: a real-time receiving module and a summary statistics module composed of server clusters;

所述实时接收模块中的若干个服务器接收所述镜像流量，将统计的流量信息写入日志文件；Several servers in the real-time receiving module receive the mirrored traffic, and write the statistical traffic information into a log file;

所述汇总统计模块将所述若干个服务器所生成的日志文件进行下载，汇总各日志文件中的流量信息得到并输出分析文件，其中所述下载的周期长度大于所述实时接收模块将统计的流量信息写入日志文件的周期长度。The summary statistics module downloads the log files generated by the several servers, summarizes the traffic information in each log file to obtain and output an analysis file, wherein the download cycle length is greater than the traffic counted by the real-time receiving module Period length for information to be written to the log file.

根据本发明一优选实施例，所述流量处理子系统还包括非实时分析集群，用于汇总所述流量存储集群存储的子流量后进行分析，所述分析包括：网络攻击行为的挖掘或者需求数据的抽取。According to a preferred embodiment of the present invention, the traffic processing subsystem further includes a non-real-time analysis cluster, which is used to analyze the sub-traffic stored in the traffic storage cluster after summarizing, and the analysis includes: mining of network attack behavior or demand data extraction.

一种大规模数据的处理方法，该方法应用于包括流量采集子系统和流量处理子系统的大规模数据处理系统，所述流量处理子系统中的流量存储集群由M台存储服务器组成，每台存储服务器下挂N个磁盘，所述方法包括：A large-scale data processing method, the method is applied to a large-scale data processing system including a traffic collection subsystem and a traffic processing subsystem, the traffic storage cluster in the traffic processing subsystem is composed of M storage servers, each N disks are hung under the storage server, and the method includes:

所述流量采集子系统采集数据流量，将采集到的数据流量进行镜像，并将得到的镜像流量分流为P路子流量发送至所述流量存储集群，P为大于1的整数；The traffic acquisition subsystem collects data traffic, mirrors the collected data traffic, and divides the obtained mirrored traffic into P-way sub-traffic and sends it to the traffic storage cluster, where P is an integer greater than 1;

每台存储服务器接收分流到的子流量，并采用负载均衡技术将分流到的子流量写入下挂的N个磁盘；其中所述M为正整数，N为大于1的整数，且M×N≥P。Each storage server receives the distributed sub-traffic, and uses load balancing technology to write the distributed sub-traffic to the N disks attached to it; wherein M is a positive integer, N is an integer greater than 1, and M×N ≥P.

根据本发明一优选实施例，所述采集数据流量具体为：采集外网核心交换机的数据流量。According to a preferred embodiment of the present invention, the collecting data flow specifically includes: collecting data flow of an external network core switch.

根据本发明一优选实施例，所述将采集到的数据流量进行镜像具体为：According to a preferred embodiment of the present invention, the mirroring of the collected data traffic is specifically:

采用分光器对采集的数据流量进行分光处理，采用光放大器对分光处理后的数据流量进行光放大形成镜像流量。An optical splitter is used to split the collected data flow, and an optical amplifier is used to optically amplify the split-processed data flow to form a mirrored flow.

根据本发明一优选实施例，所述将得到的镜像流量分流为P路子流量具体为：According to a preferred embodiment of the present invention, the splitting of the obtained mirrored traffic into P-path sub-traffic is specifically:

采用分流交换机的trunk方式将镜像流量采用负载均衡技术分流成P路子流量。The trunk mode of the distribution switch is used to distribute the mirrored traffic into P-way sub-traffic using load balancing technology.

根据本发明一优选实施例，所述采用负载均衡技术将分流到的子流量写入下挂的N个磁盘具体为：每台存储服务器上运行多个进程，每个进程分别对应所述N个磁盘中的部分磁盘，每个进程分别负责接收一部分子流量以及将接收到的所述部分子流量按预设的时间长度为单位依次轮流写入对应磁盘。According to a preferred embodiment of the present invention, the use of load balancing technology to write the distributed sub-traffic to the N disks attached to it is specifically: running multiple processes on each storage server, and each process corresponds to the N disks For some of the disks in the disk, each process is responsible for receiving a part of the sub-traffic and writing the part of the received sub-traffic to the corresponding disk in turn according to the preset time length.

根据本发明一优选实施例，所述流量采集子系统在将采集到的数据流量进行镜像时，得到两路镜像流量，其中一路镜像流量用于执行所述分流处理，另一路镜像流量被发送至所述流量处理子系统的实时分析集群；According to a preferred embodiment of the present invention, when the flow collection subsystem mirrors the collected data flow, it obtains two mirrored flows, wherein one mirrored flow is used to perform the splitting process, and the other mirrored flow is sent to a real-time analysis cluster of the traffic processing subsystem;

所述实时分析集群对接收到的镜像流量进行流量信息的统计，并利用统计结果生成分析文件。The real-time analysis cluster performs statistics on the traffic information of the received mirrored traffic, and uses the statistical results to generate an analysis file.

根据本发明一优选实施例，所述对接收到的镜像流量进行流量信息的统计，并利用统计结果生成分析文件具体为：According to a preferred embodiment of the present invention, the statistics of the traffic information of the received mirrored traffic, and using the statistical results to generate an analysis file are specifically:

所述实时分析集群中的若干个服务器接收所述镜像流量，将统计的流量信息写入日志文件；Several servers in the real-time analysis cluster receive the mirrored traffic, and write the statistical traffic information into a log file;

所述实时分析集群中的汇总统计模块将所述若干个服务器所生成的日志文件进行下载，汇总各日志文件中的流量信息得到并输出分析文件，其中所述下载的周期长度大于所述将统计的流量信息写入日志文件的周期长度。The summary statistics module in the real-time analysis cluster downloads the log files generated by the several servers, summarizes the traffic information in each log file to obtain and output the analysis file, wherein the download cycle length is longer than the statistics The cycle length for writing traffic information to the log file.

根据本发明一优选实施例，该方法还包括：According to a preferred embodiment of the present invention, the method also includes:

非实时分析集群汇总所述流量存储集群存储的子流量后进行分析，所述分析包括：网络攻击行为的挖掘或者需求数据的抽取。The non-real-time analysis cluster summarizes the sub-traffic stored in the traffic storage cluster for analysis, and the analysis includes: mining of network attack behavior or extraction of demand data.

由以上技术方案可以看出，本发明提供的系统和方法中，流量采集子系统首先将采集到的数据流量进行镜像后，将得到的镜像流量分流为多路子流量发送至流量处理子系统的流量存储集群，流量存储集群由若干台存储服务器组成，每台存储服务器将接收到的分流量采用负载均衡技术将分流到的子流量写入下挂的多个磁盘，通过这种方式降低了磁盘持续写的压力，较好地解决了大规模数据存储的问题，同时提高了磁盘利用率，有效节约了服务器成本。It can be seen from the above technical solutions that in the system and method provided by the present invention, the traffic acquisition subsystem first mirrors the collected data traffic, and then divides the obtained mirrored traffic into multiple sub-traffic flows that are sent to the traffic processing subsystem Storage cluster, traffic storage cluster is composed of several storage servers. Each storage server uses load balancing technology to write the sub-traffic received by the sub-traffic to multiple disks attached to it. In this way, the continuous The pressure of writing can better solve the problem of large-scale data storage, improve disk utilization, and effectively save server costs.

【附图说明】【Description of drawings】

图1为本发明实施例提供的大规模数据的处理系统示意图；Fig. 1 is a schematic diagram of a large-scale data processing system provided by an embodiment of the present invention;

图2为本发明实施例提供的一个系统实例图；Fig. 2 is a system example diagram provided by the embodiment of the present invention;

图3为本发明实施例提供的大规模数据的处理方法流程图。FIG. 3 is a flowchart of a large-scale data processing method provided by an embodiment of the present invention.

【具体实施方式】【detailed description】

为了使本发明的目的、技术方案和优点更加清楚，下面结合附图和具体实施例对本发明进行详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

首先对本发明所提供的大规模数据的处理系统进行描述，如图1所示，该系统可以包括：流量采集子系统100和流量处理子系统200。First, the large-scale data processing system provided by the present invention will be described. As shown in FIG. 1 , the system may include: a flow collection subsystem 100 and a flow processing subsystem 200 .

流量采集子系统100，用于采集数据流量，并将采集到的数据流量镜像到流量处理子系统200中的服务器集群。The traffic collection subsystem 100 is configured to collect data traffic, and mirror the collected data traffic to the server cluster in the traffic processing subsystem 200 .

具体可以包括：用于采集数据流量并对采集到的数据流量进行镜像的流量采集单元110，并进一步可以包括：用于采用负载均衡技术将镜像流量分流为各子流量的分流处理单元120。Specifically, it may include: a traffic collection unit 110 for collecting data traffic and mirroring the collected data traffic, and may further include: a distribution processing unit 120 for splitting the mirrored traffic into sub-traffic by adopting load balancing technology.

其中，流量采集单元110在采集数据流量时，可以布设采集点在外网核心交换机的出口，这样的布设方式一方面可以无损地采集全部的流量数据，另一方面以较少的采集点就可以完成预期目标，对于工程实施能够节约成本且降低工程难度。另外，流量采集单元110采集数据流量并进行镜像的方式可以包括以下两种：Wherein, when the flow collection unit 110 collects data flow, the collection point can be arranged at the exit of the core switch of the external network. On the one hand, this arrangement can collect all flow data without loss, and on the other hand, it can be completed with fewer collection points. The expected goal is to save costs and reduce engineering difficulty for engineering implementation. In addition, the traffic collection unit 110 may collect data traffic and perform mirroring in the following two ways:

其一、端口镜像方式：通过将外网核心交换机一个端口或多个端口的数据镜像到另一个或多个端口的方式，实现数据流量的采集，这种方式是现有技术，在此不再详述。First, the port mirroring method: by mirroring the data of one or more ports of the external network core switch to another or more ports, the collection of data traffic is realized. This method is an existing technology and will not be repeated here. detail.

其二、分光镜像方式：首先通过分光器对外网核心交换机的出口数据进行分光处理，由于分光后信号强度会有衰减，因此可以进一步对分光处理后的流量进行光放大，从而保证分光后的流量的信号强度充足，确保数据的完整和可靠。分光镜像方式相比较端口镜像方式的优点是稳定性和可靠性都较高，端口镜像方式会对核心交换机本身产生影响，对于线上服务来说，核心交换机的故障对服务的影响是致命的，因此，分光镜像方式作为一种优选的数据流量采集方式。Second, optical splitting mirroring method: firstly, the export data of the external network core switch is split through the optical splitter. Since the signal strength will be attenuated after optical splitting, the optically amplified traffic after optical splitting can be further carried out to ensure the traffic after optical splitting. The signal strength is sufficient to ensure the integrity and reliability of the data. Compared with the port mirroring method, the optical splitting mirroring method has the advantages of higher stability and reliability. The port mirroring method will affect the core switch itself. For online services, the failure of the core switch will have a fatal impact on the service. Therefore, the spectroscopic mirroring method is used as a preferred data flow collection method.

镜像后得到的一路流量可以发送给流量处理子系统200中的实时分析集群用于对流量进行实时分析，另一路流量可以发送至分流处理单元120进行进一步处理。分流处理单元120可以采用分流交换机来实现。在分流时可以采用trunk的方式，分流交换机将接收到的镜像流量采用负载均衡技术分流成多路子流量发送给流量处理子系统200中的服务器集群以对该多路子流量进行相同的处理，这里主要是对多路子流量分别进行存储处理。以处理10G的数据流量为例，交换机的一个万兆端口作为入端口接入该10G的数据流量，出端口同时存在8个千兆口作为一个trunk，这样8个千兆口会以轮叫调度(round-robin)方式将入端口的流量均匀分布在8个千兆口上，实现对高速流量的第一次负载均衡。One path of traffic obtained after mirroring can be sent to the real-time analysis cluster in the traffic processing subsystem 200 for real-time analysis of the traffic, and the other path of traffic can be sent to the distribution processing unit 120 for further processing. The distribution processing unit 120 may be implemented by using a distribution switch. The trunk mode can be used during distribution. The distribution switch uses load balancing technology to distribute the received mirrored traffic into multiple sub-traffic and sends it to the server cluster in the traffic processing subsystem 200 to perform the same processing on the multi-path sub-traffic. Here the main It is to store and process multi-path sub-traffic separately. Taking the processing of 10G data traffic as an example, a 10G port of the switch is used as the ingress port to access the 10G data traffic, and there are 8 Gigabit ports as a trunk at the same time, so the 8 Gigabit ports will be scheduled in round-robin The (round-robin) method evenly distributes the traffic of the ingress port on the 8 Gigabit ports, realizing the first load balancing of high-speed traffic.

图2为流量采集子系统100的其中一个实施方式的示意图，即分光装置将外网核心交换机的出口流量进行分光处理，光放大器对分光处理后的流量进行光放大，再经由分流交换机实现流量的分流。分光后得到的一路流量可以发送给流量处理子系统200中的实时分析集群，分光后得到的另一路流量经分流交换机处理后得到的各路子流量可以发送给流量处理子系统200中的流量存储集群，用于后续的非实时分析。Fig. 2 is a schematic diagram of one embodiment of the traffic collection subsystem 100, that is, the splitting device splits the egress traffic of the external network core switch, and the optical amplifier performs optical amplification on the traffic after the splitting process, and then realizes the splitting of the traffic through the splitting switch shunt. One stream of traffic obtained after optical splitting can be sent to the real-time analysis cluster in the traffic processing subsystem 200, and the other stream of traffic obtained after optical splitting can be processed by the splitting switch to obtain each sub-traffic that can be sent to the traffic storage cluster in the traffic processing subsystem 200 , for subsequent non-real-time analysis.

下面对流量处理子系统200中的实时分析集群210和流量存储集群220进行详细描述。The real-time analysis cluster 210 and the traffic storage cluster 220 in the traffic processing subsystem 200 will be described in detail below.

实时分析集群210对接收到的流量进行流量信息的统计，并利用统计结果生成分析文件。具体地，该实时分析集群210可以具体包括实时接收模块和汇总统计模块(图1中并未示出)。The real-time analysis cluster 210 performs statistics on the flow information of the received flow, and uses the statistical results to generate an analysis file. Specifically, the real-time analysis cluster 210 may specifically include a real-time receiving module and a summary statistics module (not shown in FIG. 1 ).

其中实时接收模块可以由服务器集群构成，该服务器集群中的各服务器运行相同的包捕获和统计程序，将统计结果写入日志(log)文件。以万兆服务器为例，每台万兆服务器支持2个万兆网卡，能够同时处理20G的数据流量，包捕获程序能够完成从万兆网卡的高效收包，统计程序以目的ip为单位进行流量信息的分别统计，统计的内容可以包括但不限于：tcp流量值、udp流量值、icmp流量值等，单位通常为bps；tcp包速率、udp包速率、icmp包速率等，单位通常为pps；非服务端口每秒的访问次数；http每秒的get请求数、get数据包的长度；http主要状态码每秒回应的数据包个数等信息。然后可以将统计结果以二进制格式写入log文件。The real-time receiving module may be composed of a server cluster, and each server in the server cluster runs the same packet capture and statistical program, and writes the statistical results into a log (log) file. Taking the 10G server as an example, each 10G server supports 2 10G network cards and can handle 20G data traffic at the same time. The packet capture program can complete efficient packet collection from the 10G network card, and the statistics program uses the destination ip as the unit for traffic Separate statistics of information, the statistical content may include but not limited to: tcp flow value, udp flow value, icmp flow value, etc., the unit is usually bps; tcp packet rate, udp packet rate, icmp packet rate, etc., the unit is usually pps; The number of visits per second of non-service ports; the number of get requests per second of http, the length of get data packets; the number of data packets per second of http main status code responses, etc. The statistical results can then be written to the log file in binary format.

汇总统计模块将实时接收模块的服务器集群生成的log文件进行下载，其中下载的周期长度通常大于实时接收模块将统计的流量信息写入log文件的周期长度。然后对各log文件中的流量信息进行汇总得到分析文件，输出该分析文件。例如，可以对各log文件中相同目的ip的流量信息进行汇总。The summary statistics module downloads the log files generated by the server cluster of the real-time receiving module, and the download cycle length is usually longer than the cycle length of the real-time receiving module writing the statistical traffic information into the log file. Then, the flow information in each log file is summarized to obtain an analysis file, and the analysis file is output. For example, the traffic information of the same destination IP in each log file can be summarized.

流量存储集群220是由M台存储服务器组成的集群，M为正整数，完成的主要功能是将接收到的流量以高效可靠的方式写入磁盘进行保存。由于接收到海量的数据包，且实际线上应用处理的流量通常是几十甚至上百G/s的速率，需要将大规模的流量以较小的成本代价存储在慢速的磁盘。在本发明中每一台存储服务器下挂N个磁盘，N为大于1的正整数，且M×N≥P，P为流量处理子系统200分流后得到的子流量数量。存储服务器接收分流到的流量，将接收到的流量采用负载均衡技术写入各磁盘，具体地，可以按预设的时间长度为单位依次轮流写入各磁盘。其中每一台存储服务器上可以运行多个进程，每个进程分别对应部分磁盘，每个进程分别负责其中一部子流量的接收以及将该部子流量按预设的时间长度为单位依次轮流写入对应磁盘。The traffic storage cluster 220 is a cluster composed of M storage servers, where M is a positive integer, and its main function is to write the received traffic to disk in an efficient and reliable manner for storage. Since a large number of data packets are received, and the traffic processed by the actual online application is usually tens or even hundreds of G/s, it is necessary to store large-scale traffic on a slow disk at a relatively low cost. In the present invention, N disks are attached to each storage server, N is a positive integer greater than 1, and M×N≥P, and P is the number of sub-flows obtained after the traffic processing subsystem 200 splits. The storage server receives the distributed traffic, and writes the received traffic to each disk using a load balancing technology. Specifically, the traffic may be written to each disk in turn according to a preset time length. Each storage server can run multiple processes, and each process corresponds to a part of the disk. Each process is responsible for receiving a part of the sub-traffic and writing the part of the sub-traffic in turn according to the preset time length. into the corresponding disk.

举一个例子，假设流量存储子系统包括两台存储服务器，每台存储服务器携带一张4口千兆网卡，挂载8块磁盘，每块1T。分流处理单元120分流后得到8路子流量，每台存储服务器上同时运行4个独立进程，分别从4个千兆网卡接收流量，即负责接收其中4路子流量，每个进程对应2块磁盘。每个进程将流量往磁盘写的过程中再一次采用了负载均衡策略，即第二次负载均衡，可以以分钟为单位依次轮流写入2块磁盘，第一分钟的流量写入第一块磁盘，第二分钟的流量写入第二块磁盘，第三分钟的流量写入第一块磁盘，第四分钟的流量写入第二块磁盘，以此类推。这种负载均衡策略充分利用了各个进程和磁盘的独立性，降低了磁盘持续写的压力，较好地解决了大规模数据存储的问题，同时提高了磁盘利用率，有效地节约了服务器成本。To give an example, assume that the traffic storage subsystem includes two storage servers, each of which carries a 4-port Gigabit network card and mounts 8 disks, each 1T. The offload processing unit 120 obtains 8 sub-flows after offloading, and each storage server runs 4 independent processes at the same time, respectively receiving traffic from 4 Gigabit network cards, that is, responsible for receiving 4 sub-flows, and each process corresponds to 2 disks. When each process writes traffic to the disk, the load balancing strategy is adopted again, that is, the second load balancing can be written to two disks in turn in minutes, and the traffic in the first minute is written to the first disk. , the traffic of the second minute is written to the second disk, the traffic of the third minute is written to the first disk, the traffic of the fourth minute is written to the second disk, and so on. This load balancing strategy makes full use of the independence of each process and disk, reduces the pressure of continuous disk writing, better solves the problem of large-scale data storage, improves disk utilization, and effectively saves server costs.

除此之外，流量处理子系统200还可以进一步包括非实时分析集群230，用于对流量存储集群220存储的流量进行汇总后进行分析，包括但不限于：网络攻击行为的挖掘或者需求数据的抽取等。In addition, the traffic processing subsystem 200 may further include a non-real-time analysis cluster 230, which is used for summarizing and analyzing the traffic stored in the traffic storage cluster 220, including but not limited to: mining of network attack behavior or collection of demand data extraction etc.

在进行网络攻击行为的挖掘时，可以抽取攻击时段的流量，基于所抽取流量的特征进行攻击行为分析。例如，对于常见的网络攻击，主要包括网络层面带宽型攻击、tcp层的synflood和ack flood攻击、应用层的分布式请求攻击。各种网络攻击行为会对产品的稳定运行带来影响，我们能够基于以存储的历史数据即流量存储集群220存储的流量来深度解析攻击特征，为产品线的防御和攻击行为的取证提供服务。对于网络层面的带宽型攻击，常见的有udp flood和icmp flood，我们通过抽取攻击时段的流量，然后统计该时段各种类型的流量大小来判断攻击类型和攻击规模。对于tcp层协议栈资源耗尽型攻击，通过抽取攻击时段的流量，统计时段tcp标志位各种类型的包速率来判断攻击类型和攻击规模。对于应用层的分布式请求攻击，通过抽取攻击时段的数据包，统计该时段http请求头的各个字段，包括host、url、cookie、User-Agent或referer等字段，来判断攻击类型，并进一步判断被攻击的产品线及相关页面，同时归纳总结http头部的请求特征，为封禁策略提供识别标志。When mining network attack behavior, the traffic during the attack period can be extracted, and attack behavior analysis can be performed based on the characteristics of the extracted traffic. For example, common network attacks mainly include bandwidth attacks at the network level, synflood and ack flood attacks at the tcp layer, and distributed request attacks at the application layer. Various network attacks will affect the stable operation of the product. Based on the stored historical data, that is, the traffic stored in the traffic storage cluster 220, we can deeply analyze the attack characteristics and provide services for the defense of the product line and the forensics of attack behaviors. For bandwidth-based attacks at the network level, common ones are udp flood and icmp flood. We extract the traffic during the attack period, and then count the size of various types of traffic during this period to determine the attack type and attack scale. For tcp layer protocol stack resource exhaustion attacks, the attack type and attack scale are judged by extracting the traffic during the attack period and counting the packet rates of various types of tcp flags during the period. For distributed request attacks at the application layer, the attack type is determined by extracting data packets during the attack period and counting the fields of the HTTP request header during this period, including host, url, cookie, User-Agent, or referer, and further judging Attacked product lines and related pages, and at the same time summarize the request characteristics of the http header to provide identification marks for the banning strategy.

目前业务方面对过去访问记录的需求包括追踪问题和产品的线下测试，需求数据的抽取正是为了满足该需求。具体实现方法是基于流量存储集群220存储的流量，非实时分析集群230根据产品线的目的ip，从流量存储集群220存储的流量中抽取对应目的ip的数据包并以诸如抓包(pcap)文件格式存储下来，用于后续将该数据包提供给业务需求方。The current business needs for past access records include tracking problems and offline testing of products, and the extraction of demand data is precisely to meet this demand. The specific implementation method is based on the traffic stored by the traffic storage cluster 220, and the non-real-time analysis cluster 230 extracts the data packets corresponding to the destination IP from the traffic stored by the traffic storage cluster 220 according to the purpose IP of the product line and sends them as a packet capture (pcap) file The format is stored and used to provide the data package to the business demander later.

基于上述处理系统实现的大规模数据的处理方法可以如图3所示，主要包括以下步骤：The large-scale data processing method realized based on the above-mentioned processing system can be shown in Figure 3, mainly including the following steps:

步骤301：流量采集子系统采集数据流量，将采集到的数据流量进行镜像，将得到的其中一路镜像流量执行步骤302；将得到的另一路镜像流量发送至流量处理子系统中的实时分析集群，执行步骤305。Step 301: the traffic collection subsystem collects data traffic, mirrors the collected data traffic, and executes step 302 with one of the mirrored traffic obtained; sends the other mirrored traffic obtained to the real-time analysis cluster in the traffic processing subsystem, Execute step 305.

在采集数据流量时，采集点可以布设在外网核心交换机的出口，即采集外网核心交换机的数据流量。When collecting data flow, the collection point can be arranged at the exit of the core switch of the external network, that is, to collect the data flow of the core switch of the external network.

所述将采集到的数据流量进行镜像的方式可以具体包括以下两种：The manner of mirroring the collected data traffic may specifically include the following two types:

步骤302：将镜像流量分流为P路子流量发送至流量处理子系统中的流量存储集群。P为大于1的整数。Step 302: Split the mirrored traffic into P-path sub-traffic and send it to the traffic storage cluster in the traffic processing subsystem. P is an integer greater than 1.

在本步骤中进行的分流处理可以由分流交换机实现，分流交换机采用trunk方式将镜像流量采用负载均衡技术分流成P路子流量。The distribution processing performed in this step may be implemented by a distribution switch, which uses a trunk method to distribute the mirrored traffic into P-path sub-traffic by using a load balancing technology.

步骤303：流量存储集群中的M台存储服务器分别接收分流到的子流量，并采用负载均衡技术将分流到的子流量写入下挂的N个磁盘；其中所述M为正整数，N为大于1的整数，且M×N≥P。Step 303: M storage servers in the flow storage cluster respectively receive the distributed sub-traffic, and use load balancing technology to write the distributed sub-traffic to the N disks attached to it; wherein M is a positive integer, and N is An integer greater than 1, and M×N≥P.

本步骤中采用的负载均衡方式可以按照预设的时间长度为单位依次轮流写入磁盘。其中每一台存储服务器上可以运行多个进程，每个进程分别对应部分磁盘，每个进程分别负责其中一部分子流量的接收以及将该部分流量按预设的时间长度为单位依次轮流写入对应磁盘。这种负载均衡策略充分利用了各个进程和磁盘的独立性，降低了磁盘持续写的压力，较好地解决了大规模数据存储的问题，同时提高了磁盘利用率，有效地节约了服务器成本。The load balancing method adopted in this step can write to the disk in turn according to the preset time length as a unit. Each storage server can run multiple processes, and each process corresponds to a part of the disk. Each process is responsible for receiving part of the sub-traffic and writing the part of the traffic to the corresponding part of the traffic in turn according to the preset time length. disk. This load balancing strategy makes full use of the independence of each process and disk, reduces the pressure of continuous disk writing, better solves the problem of large-scale data storage, improves disk utilization, and effectively saves server costs.

步骤304：流量处理子系统中的非实时分析集群汇总流量存储集群存储的子流量后进行分析，执行的分析包括但不限于：网络攻击行为的挖掘或者需求数据的抽取。Step 304: The non-real-time analysis cluster in the traffic processing subsystem summarizes the sub-traffic stored in the traffic storage cluster and then conducts analysis. The analysis performed includes but is not limited to: mining of network attack behavior or extraction of demand data.

在进行网络攻击行为的挖掘时，可以抽取攻击时段的流量，基于所抽取流量的特征进行攻击行为分析。例如，对于常见的网络攻击，主要包括网络层面带宽型攻击、tcp层的synflood和ack flood攻击、应用层的分布式请求攻击。各种网络攻击行为会对产品的稳定运行带来影响，我们能够基于以存储的历史数据即流量存储集群存储的流量来深度解析攻击特征，为产品线的防御和攻击行为的取证提供服务。对于网络层面的带宽型攻击，常见的有udpflood和icmp flood，我们通过抽取攻击时段的流量，然后统计该时段各种类型的流量大小来判断攻击类型和攻击规模。对于tcp层协议栈资源耗尽型攻击，通过抽取攻击时段的流量，统计时段tcp标志位各种类型的包速率来判断攻击类型和攻击规模。对于应用层的分布式请求攻击，通过抽取攻击时段的数据包，统计该时段http请求头的各个字段，包括host、url、cookie、User-Agent或referer等字段，来判断攻击类型，并进一步判断被攻击的产品线及相关页面，同时归纳总结http头部的请求特征，为封禁策略提供识别标志。When mining network attack behavior, the traffic during the attack period can be extracted, and attack behavior analysis can be performed based on the characteristics of the extracted traffic. For example, common network attacks mainly include bandwidth attacks at the network level, synflood and ack flood attacks at the tcp layer, and distributed request attacks at the application layer. Various network attacks will affect the stable operation of the product. We can analyze the attack characteristics in depth based on the stored historical data, that is, the traffic stored in the traffic storage cluster, and provide services for the defense of the product line and the forensics of attack behaviors. For bandwidth-based attacks at the network level, common ones are udpflood and icmp flood. We extract the traffic during the attack period, and then count the various types of traffic during this period to determine the attack type and attack scale. For tcp layer protocol stack resource exhaustion attacks, the attack type and attack scale are judged by extracting the traffic during the attack period and counting the packet rates of various types of tcp flags during the period. For distributed request attacks at the application layer, the attack type is determined by extracting data packets during the attack period and counting the fields of the HTTP request header during this period, including host, url, cookie, User-Agent, or referer, and further judging Attacked product lines and related pages, and at the same time summarize the request characteristics of the http header to provide identification marks for the banning strategy.

目前业务方面对过去访问记录的需求包括追踪问题和产品的线下测试，需求数据的抽取正是为了满足该需求。具体实现方法是基于流量存储集群存储的流量，非实时分析集群根据产品线的目的ip，从流量存储集群存储的流量中抽取对应目的ip的数据包并以诸如pcap文件格式存储下来，用于后续将该数据包提供给业务需求方。The current business needs for past access records include tracking problems and offline testing of products, and the extraction of demand data is precisely to meet this demand. The specific implementation method is based on the traffic stored in the traffic storage cluster. The non-real-time analysis cluster extracts the data packets corresponding to the destination IP from the traffic stored in the traffic storage cluster according to the destination IP of the product line and stores them in a format such as a pcap file for subsequent use. Provide the data package to the business demand side.

步骤305：实时分析集群对接收到的镜像流量进行流量信息的统计，并利用统计结果生成分析文件。Step 305: The real-time analysis cluster performs statistics on the traffic information of the received mirrored traffic, and uses the statistical results to generate an analysis file.

在本步骤中，实时分析集群中的若干个服务器接收镜像流量，将统计的流量信息写入log文件。然后实时分析集群中的汇总统计模块将上述若干个服务器所生成的log文件进行下载，汇总各日志文件中的流量信息得到并输出分析文件，其中汇总统计模块下载log文件的周期长度大于上述若干个服务器将统计的流量信息写入日志文件的周期长度。In this step, the mirror traffic received by several servers in the cluster is analyzed in real time, and the statistical traffic information is written into a log file. Then the summary statistics module in the real-time analysis cluster downloads the log files generated by the above-mentioned several servers, summarizes the flow information in each log file to obtain and output the analysis file, and the cycle length of the summary statistics module downloading log files is longer than the above-mentioned several The cycle length for the server to write the statistical traffic information into the log file.

上述若干个服务器运行相同的包捕获和统计程序，包捕获程序能够完成从万兆网卡的高效收包，统计程序以目的ip为单位进行流量信息的分别统计，统计的内容可以包括但不限于：tcp流量值、udp流量值、icmp流量值等，单位通常为bps；tcp包速率、udp包速率、icmp包速率等，单位通常为pps；非服务端口每秒的访问次数；http每秒的get请求数、get数据包的长度；http主要状态码每秒回应的数据包个数等信息。然后可以将统计结果以二进制格式写入log文件。The above-mentioned several servers run the same packet capture and statistics program. The packet capture program can complete the efficient packet collection from the 10 Gigabit network card. The statistics program uses the destination IP as the unit to perform statistics on the flow information. The statistics can include but not limited to: tcp flow value, udp flow value, icmp flow value, etc., the unit is usually bps; tcp packet rate, udp packet rate, icmp packet rate, etc., the unit is usually pps; non-service port access times per second; http get per second The number of requests, the length of the get data packet; the number of data packets responded by the main status code of http per second and other information. The statistical results can then be written to the log file in binary format.

本发明提供的上述系统和方法，通过流量镜像、存储服务器集群以及所下挂磁盘的流量存储负载均衡，实现了大规模数据的存储需求，进一步通过实时分析集群对大规模的镜像流量实现实时分析需求，通过非实时分析集群对存储服务器集群存储的数据进行汇总分析实现对大规模数据的非实时分析需求。经验证，本发明能够很好的处理带宽超过100G的数据流，且数据具备完整性和稳定性，且网络设备成本方面的优势明显。The above-mentioned system and method provided by the present invention realize the storage requirements of large-scale data through traffic mirroring, storage server clusters, and traffic storage load balancing of attached disks, and further realize real-time analysis of large-scale mirrored traffic through real-time analysis clusters The non-real-time analysis requirements for large-scale data are realized through the non-real-time analysis cluster to aggregate and analyze the data stored in the storage server cluster. It has been verified that the present invention can well process data streams with a bandwidth exceeding 100G, and the data has integrity and stability, and has obvious advantages in terms of network equipment cost.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.

Claims

1. A large-scale data processing system, characterized in that the system comprises: a flow collection subsystem and a flow processing subsystem;

The flow collection subsystem is used to collect the data flow at the exit of the external network core switch, perform optical splitting processing on the collected data flow, use an optical amplifier to optically amplify the data flow after the splitting processing to form a mirror image flow, and convert the obtained The mirrored traffic distribution is sent to the traffic storage cluster in the traffic processing subsystem as P-way sub-traffic, and P is an integer greater than 1;

The traffic storage cluster is composed of M storage servers, and each storage server is connected with N disks, the M is a positive integer, and N is an integer greater than 1, and M×N≥P; each storage server receives and distributes traffic to sub-traffic, and use load balancing technology to write the split sub-traffic to the N disks attached to it.

2. The system according to claim 1, wherein the flow collection subsystem comprises:

A traffic collection unit for collecting data traffic at the egress of the external network core switch and mirroring the collected data traffic, and

A split processing unit for splitting mirrored traffic into sub-traffic by using load balancing technology.

3. The system according to claim 2, wherein the flow collection unit is composed of an optical splitter and an optical amplifier;

The optical splitter performs optical splitting processing on the data traffic at the exit of the core switch of the external network, and the optical amplifier performs optical amplification on the data traffic after optical splitting processing to form a mirrored traffic.

4. The system according to claim 2, wherein the distribution processing unit is a distribution switch, which uses a trunk method to distribute mirrored traffic into P-way sub-traffic using load balancing technology.

5. The system according to claim 1, wherein a plurality of processes are run on each storage server, and each process corresponds to a part of the N disks respectively, and each process is responsible for receiving a part of sub-traffic and The received part of the sub-traffic is sequentially written to the corresponding disk according to the unit of the preset time length.

6. The system according to claim 1, wherein the traffic processing subsystem further comprises a real-time analysis cluster;

The traffic acquisition subsystem mirrors the collected data traffic to obtain two mirrored traffics, wherein one mirrored traffic is used to perform the split processing, and the other mirrored traffic is sent to the real-time analysis cluster;

The real-time analysis cluster is used to perform statistics on the traffic information of the received mirrored traffic, and use the statistical results to generate analysis files.

7. The system according to claim 6, wherein the real-time analysis cluster comprises: a real-time receiving module and a summary statistics module composed of server clusters;

Several servers in the real-time receiving module receive the mirrored traffic, and write the statistical traffic information into a log file;

The summary statistics module downloads the log files generated by the several servers, summarizes the traffic information in each log file to obtain and output an analysis file, wherein the download cycle length is greater than the traffic counted by the real-time receiving module Period length for information to be written to the log file.

8. The system according to claim 1, wherein the traffic processing subsystem further includes a non-real-time analysis cluster for analyzing sub-traffic stored in the traffic storage cluster, and the analysis includes: network Attack behavior mining or demand data extraction.

9. A large-scale data processing method, characterized in that the method is applied to a large-scale data processing system comprising a flow acquisition subsystem and a flow processing subsystem, and the flow storage cluster in the flow processing subsystem consists of M units Composed of storage servers, N disks are hung under each storage server, and the method includes:

The traffic collection subsystem collects the data traffic at the outlet of the external network core switch, performs optical splitting processing on the collected data traffic, uses an optical amplifier to optically amplify the split-processed data traffic to form mirror traffic, and shunts the obtained mirror traffic Sending sub-traffic of P paths to the traffic storage cluster, where P is an integer greater than 1;

Each storage server receives the distributed sub-traffic, and uses load balancing technology to write the distributed sub-traffic to the N disks attached to it; wherein M is a positive integer, N is an integer greater than 1, and M×N ≥P.

10. The method according to claim 9, wherein said splitting the obtained mirrored traffic into P-way sub-traffic is specifically:

The trunk mode of the distribution switch is used to distribute the mirrored traffic into P-way sub-traffic using load balancing technology.

11. The method according to claim 9, characterized in that, the use of load balancing technology to write the distributed sub-traffic to the N disks attached to it is specifically: running multiple processes on each storage server, each The processes respectively correspond to some of the N disks, and each process is responsible for receiving a part of the sub-traffic and writing the part of the received sub-traffic to the corresponding disk in turn according to the preset time length.

12. The method according to claim 9, characterized in that, when the traffic collection subsystem mirrors the collected data traffic, it obtains two paths of mirrored traffic, wherein one path of mirrored traffic is used to perform the splitting process, Another path of mirrored traffic is sent to the real-time analysis cluster of the traffic processing subsystem;

The real-time analysis cluster performs statistics on the traffic information of the received mirrored traffic, and uses the statistical results to generate an analysis file.

13. The method according to claim 12, characterized in that, performing statistics on the traffic information of the received mirrored traffic, and using the statistical results to generate an analysis file is specifically:

Several servers in the real-time analysis cluster receive the mirrored traffic, and write the statistical traffic information into a log file;

The summary statistics module in the real-time analysis cluster downloads the log files generated by the several servers, summarizes the traffic information in each log file to obtain and output the analysis file, wherein the download cycle length is longer than the statistics The cycle length for writing traffic information to the log file.

14. The method of claim 9, further comprising:

The non-real-time analysis cluster summarizes the sub-traffic stored in the traffic storage cluster for analysis, and the analysis includes: mining of network attack behavior or extraction of demand data.