CN114338437B

CN114338437B - Network traffic classification method and device, electronic equipment and storage medium

Info

Publication number: CN114338437B
Application number: CN202210039374.9A
Authority: CN
Inventors: 杨杨; 高志鹏; 严雨; 吕睿; 高博文; 赵斌男; 李昱廷; 郭义豪; 龚兴乐; 胡皓; 刘澳伦; 龙雨寒
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-01-13
Filing date: 2022-01-13
Publication date: 2023-12-29
Anticipated expiration: 2042-01-13
Also published as: CN114338437A

Abstract

The invention provides a network traffic classification method, device, electronic equipment and storage medium, which divides the captured pcap file into a flow sequence, and the flow sequence is composed of multiple flow data packets; and extracts the characters of each flow data packet from the flow sequence. node features, obtain a byte sequence in units of streams; perform position encoding on each byte in the byte sequence, and input the encoded byte sequence into the traffic classification network model to obtain the traffic The traffic classification result output by the classification network model; wherein the traffic classification network model is obtained after training based on samples in flow units and the traffic classification results corresponding to the samples. The present invention separately performs position coding for each byte in the byte sequence, can effectively extract the key position information of each byte in the byte sequence, and improves the accuracy of traffic classification network model identification.

Description

Network traffic classification method, device, electronic equipment and storage medium

技术领域Technical field

本发明涉及网络流量管理技术领域，尤其涉及一种网络流量分类方法、装置、电子设备及存储介质。The present invention relates to the technical field of network traffic management, and in particular to a network traffic classification method, device, electronic equipment and storage medium.

背景技术Background technique

流量分类是现代通信网络中的一项重要任务。由于高吞吐量流量需求的快速增长，正确管理网络资源、识别使用网络资源不同类型的应用程序变得至关重要。Traffic classification is an important task in modern communication networks. Due to the rapid growth in demand for high-throughput traffic, it has become critical to properly manage network resources and identify the different types of applications that use network resources.

目前，互联网上新应用程序的出现以及各种组件之间的交互极大地增加了网络的复杂性和多样性，使得流量分类本身成为一个难题，网络流量分类面对着越来越多的挑战。为了应对这些挑战，现有技术将深度学习方法应用在流量分类领域以实现高性能的分类器。Currently, the emergence of new applications on the Internet and the interactions between various components have greatly increased the complexity and diversity of the network, making traffic classification itself a difficult problem. Network traffic classification faces more and more challenges. In order to cope with these challenges, existing technologies apply deep learning methods in the field of traffic classification to achieve high-performance classifiers.

然而，这种深度学习网络流量分类方法一方面在提取流量特征的过程中依赖于大量的专家经验，特征提取结果存在一定的偏差。另一方面在提取到流量特征序列后，并没有充分挖掘特征序列中各字节所包含的上下文关键信息，导致基于深度学习的流量分类模型最终输出的流量分类结果准确度不高。However, this deep learning network traffic classification method relies on a large amount of expert experience in the process of extracting traffic features, and there is a certain deviation in the feature extraction results. On the other hand, after the traffic feature sequence is extracted, the contextual key information contained in each byte in the feature sequence is not fully explored, resulting in a low accuracy of the traffic classification result finally output by the deep learning-based traffic classification model.

发明内容Contents of the invention

本发明提供一种网络流量分类方法、装置、电子设备及存储介质，用以解决现有技术中网络流量分类结果精确度不高的缺陷，实现有效提取字节序列各字节的关键位置信息，提高深度学习流量分类模型的分类准确率。The present invention provides a network traffic classification method, device, electronic equipment and storage medium to solve the defect of low accuracy of network traffic classification results in the prior art and achieve effective extraction of key position information of each byte of a byte sequence. Improve the classification accuracy of deep learning traffic classification models.

本发明提供一种网络流量分类方法，包括：The present invention provides a network traffic classification method, including:

将捕获的pcap文件切分为流序列，所述流序列由多个流量数据包组成；Divide the captured pcap file into a flow sequence, which consists of multiple traffic data packets;

从所述流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列；Extract the byte characteristics of each traffic data packet from the flow sequence to obtain a byte sequence in units of flows;

对所述字节序列中的各个字节进行位置编码，并将编码后的所述字节序列输入至流量分类网络模型中，得到所述流量分类网络模型输出的流量分类结果；其中，所述流量分类网络模型是基于以流为单位的样本和样本对应的流量分类结果训练后得到的。Positionally encode each byte in the byte sequence, and input the encoded byte sequence into the traffic classification network model to obtain the traffic classification result output by the traffic classification network model; wherein, The traffic classification network model is trained based on samples in flow units and the traffic classification results corresponding to the samples.

根据本发明提供的一种网络流量分类方法，所述将捕获的pcap文件切分为流序列，包括：According to a network traffic classification method provided by the present invention, dividing the captured pcap file into flow sequences includes:

基于五元组对所述pcap文件中的数据包流进行切分，得到流序列；所述五元组包括：源IP地址、源端口、目的IP地址、目的端口和协议号。The data packet flow in the pcap file is segmented based on the five-tuple to obtain a flow sequence; the five-tuple includes: source IP address, source port, destination IP address, destination port, and protocol number.

根据本发明提供的一种网络流量分类方法，从所述流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列，包括：According to a network traffic classification method provided by the present invention, the byte characteristics of each traffic data packet are extracted from the flow sequence to obtain a byte sequence in units of flows, including:

基于预设规则从所述流序列中提取各个流量数据包预设数量的字节特征，得到以流为单位的字节序列。Based on preset rules, a preset number of byte features of each traffic data packet are extracted from the flow sequence to obtain a byte sequence in units of flows.

根据本发明提供的一种网络流量分类方法，所述对所述字节序列中的各个字节进行位置编码，包括：According to a network traffic classification method provided by the present invention, position encoding each byte in the byte sequence includes:

基于下述公式对所述字节序列中的各个字节进行位置编码，将每个字节在数据包中的位置转化为d维的特征向量P_pos，所述公式为：Position encoding is performed on each byte in the byte sequence based on the following formula, and the position of each byte in the data packet is converted into a d-dimensional feature vector P _pos . The formula is:

P(pos,2i)＝sin(pos/m^2i/d)P(pos,2i)＝sin(pos/m ^2i/d )

P(pos,2i+1)＝cos(pos/m²ⁱ)P(pos,2i+1)＝cos(pos/m ²ⁱ )

其中，2i,2i+1∈[0,d-1]，表示生成的位置编码的每个通道，m为常数，用于使每个字节的位置对应唯一的位置编码。Among them, 2i, 2i+1∈[0,d-1] represents each channel of the generated position code, and m is a constant used to make the position of each byte correspond to a unique position code.

根据本发明提供的一种网络流量分类方法，所述流量分类网络模型由N个自动编码器层构成，N≥2，所述流量分类网络模型的损失函数为：According to a network traffic classification method provided by the present invention, the traffic classification network model is composed of N autoencoder layers, N≥2, and the loss function of the traffic classification network model is:

其中，h_i-1为第i个自动编码器的输入层，N为以流为单位的样本数量。Among them, h _i-1 is the input layer of the i-th autoencoder, and N is the number of samples in units of streams.

本发明还提供一种网络流量分类装置，包括：The present invention also provides a network traffic classification device, including:

第一处理模块，用于将捕获的pcap文件切分为流序列，所述流序列由多个流量数据包组成；The first processing module is used to segment the captured pcap file into a flow sequence, where the flow sequence is composed of multiple traffic data packets;

第二处理模块，用于从所述流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列；The second processing module is used to extract the byte characteristics of each traffic data packet from the flow sequence to obtain a byte sequence in units of flows;

第三处理模块，用于对所述字节序列中的各个字节进行位置编码，并将编码后的所述字节序列输入至流量分类网络模型中，得到所述流量分类网络模型输出的流量分类结果；其中，所述流量分类网络模型是基于以流为单位的样本和样本对应的流量分类结果训练后得到的。The third processing module is used to perform position encoding on each byte in the byte sequence, and input the encoded byte sequence into the traffic classification network model to obtain the traffic output by the traffic classification network model. Classification results; wherein, the traffic classification network model is obtained after training based on samples in flow units and the traffic classification results corresponding to the samples.

根据本发明提供的一种网络流量分类装置，所述第三处理模块，具体用于：According to a network traffic classification device provided by the present invention, the third processing module is specifically used for:

P(pos,2i)＝sin(pos/m^2i/d)P(pos,2i)＝sin(pos/m ^2i/d )

P(pos,2i+1)＝cos(pos/m²ⁱ)P(pos,2i+1)＝cos(pos/m ²ⁱ )

本发明还提供一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时实现如上述任一种所述网络流量分类方法的步骤。The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements any of the above network traffic classifications. Method steps.

本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如上述任一种所述网络流量分类方法的步骤。The present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, the steps of any of the above network traffic classification methods are implemented.

本发明还提供一种计算机程序产品，包括计算机程序，所述计算机程序被处理器执行时实现如上述任一种所述网络流量分类方法的步骤。The present invention also provides a computer program product, which includes a computer program. When the computer program is executed by a processor, the steps of any one of the above network traffic classification methods are implemented.

本发明提供的网络流量分类方法、装置、电子设备及存储介质，通过将捕获的pcap文件切分为流序列，从流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列。然后对所述字节序列中的各个字节进行位置编码，并将编码后的所述字节序列输入至流量分类网络模型中，得到所述流量分类网络模型输出的流量分类结果；其中，所述流量分类网络模型是基于以流为单位的样本和样本对应的流量分类结果训练后得到的。本发明从原始流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列作为流量分类网络模型的输入，相比于现有的手工提取流的统计特征，既减少了模型输入数据的规模，又充分挖掘了流量数据的时序特征。此外，本发明为字节序列中的每个字节分别进行位置编码，可以有效提取字节序列中各字节的关键位置信息，提高流量分类网络模型识别的准确率。The network traffic classification method, device, electronic equipment and storage medium provided by the present invention divide the captured pcap file into a flow sequence, extract the byte characteristics of each flow data packet from the flow sequence, and obtain the word character in flow units. section sequence. Then, each byte in the byte sequence is position-encoded, and the encoded byte sequence is input into the traffic classification network model to obtain the traffic classification result output by the traffic classification network model; wherein, The traffic classification network model described above is obtained after training based on samples in flow units and the traffic classification results corresponding to the samples. The present invention extracts the byte characteristics of each traffic data packet from the original flow sequence, and obtains the byte sequence in flow units as the input of the traffic classification network model. Compared with the existing manual extraction of statistical characteristics of the flow, it not only reduces The scale of model input data fully exploits the time series characteristics of traffic data. In addition, the present invention separately performs position coding for each byte in the byte sequence, which can effectively extract the key position information of each byte in the byte sequence and improve the accuracy of traffic classification network model identification.

附图说明Description of the drawings

为了更清楚地说明本发明或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are of the present invention. For some embodiments of the invention, those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.

图1是本发明提供的网络流量分类方法的流程示意图之一；Figure 1 is one of the flow diagrams of the network traffic classification method provided by the present invention;

图2是本发明提供的网络流量分类方法的流程示意图之二；Figure 2 is the second schematic flow chart of the network traffic classification method provided by the present invention;

图3是本发明提供的网络流量分类装置的结构示意图；Figure 3 is a schematic structural diagram of the network traffic classification device provided by the present invention;

图4是本发明提供的电子设备的结构示意图。Figure 4 is a schematic structural diagram of the electronic device provided by the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明中的附图，对本发明中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the present invention more clear, the technical solutions in the present invention will be clearly and completely described below in conjunction with the accompanying drawings of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention. , not all examples. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.

如图1所示，本发明提供的网络流量分类方法，包括：As shown in Figure 1, the network traffic classification method provided by the present invention includes:

步骤101：将捕获的pcap文件切分为流序列，所述流序列由多个流量数据包组成；Step 101: Divide the captured pcap file into a flow sequence, where the flow sequence is composed of multiple traffic data packets;

在本步骤中，首先在链路连接中捕获应用程序的pcap文件，并按照五元组切分pcap文件中的数据包流，得到由多个数据包组成的流序列。In this step, the pcap file of the application is first captured in the link connection, and the packet flow in the pcap file is divided according to the five-tuple to obtain a flow sequence composed of multiple packets.

步骤102：从所述流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列；Step 102: Extract the byte characteristics of each traffic data packet from the flow sequence to obtain a byte sequence in units of flows;

在本步骤中，可选的，选择初始值M＝40作为每个数据包采用的字节数(因为报头位于数据包的包头，而TCP的公共报头长度大于UDP的报头长度，为40字节)。另外，考虑到常用的随机端口分配和网络地址转换技术可能会对分类结果产生混淆，本发明用零来替代IP地址和端口号来避免这种影响。因此，单个数据包的字节处理结果为D＝{d₁,…,d_i,…,d_M}，其中d_i为数据包的第i(i<M)个字节，取值范围为[0,255]，归一化处理，使取值范围为[0,1]。对于每个以流为单位的序列样本f_i，字节序列F_packet如下述所示，N的大小等于数据包的数量。In this step, optionally, select the initial value M = 40 as the number of bytes used in each data packet (because the header is located at the header of the data packet, and the length of the public header of TCP is greater than the length of the header of UDP, which is 40 bytes ). In addition, considering that commonly used random port allocation and network address translation techniques may confuse the classification results, the present invention uses zeros to replace IP addresses and port numbers to avoid this effect. Therefore, the byte processing result of a single data packet is D={d ₁ ,...,d _i ,...,d _M }, where _di is the i-th (i<M) byte of the data packet, and the value range is [0,255], normalized so that the value range is [0,1]. For each sequence sample _fi in flow units, the byte sequence F _packet is as follows, with the size of N equal to the number of packets.

F_packet＝{D₁,D₂,D₃…D_N},N＝packet length of f_i F _packet = {D ₁ , D ₂ , D ₃ ...D _N }, N = packet length of f _i

步骤103：对所述字节序列中的各个字节进行位置编码，并将编码后的所述字节序列输入至流量分类网络模型中，得到所述流量分类网络模型输出的流量分类结果；其中，所述流量分类网络模型是基于以流为单位的样本和样本对应的流量分类结果训练后得到的。Step 103: Perform position encoding on each byte in the byte sequence, and input the encoded byte sequence into the traffic classification network model to obtain the traffic classification result output by the traffic classification network model; wherein , the traffic classification network model is obtained after training based on samples in flow units and the traffic classification results corresponding to the samples.

在本步骤中，需要说明的是，数据包包含IP报头、传输层报头和有效负载。不同位置的字节往往代表着不同的含义，且相互影响。例如，IP头中的“版本”决定了“源IP地址”的长度是4字节(IPv4)或16字节(IPv6)。因此，当应用程序通过格式良好的头交换信息时，其位置因素的信息时非常重要，不可忽视的。因此，本发明在采用流量分类网络模型对数据包特征进行分析时，首先对数据包的各字节进行位置编码，以提高后续流量识别的准确性。In this step, it should be noted that the data packet contains the IP header, transport layer header and payload. Bytes in different positions often represent different meanings and influence each other. For example, the "version" in the IP header determines whether the length of the "source IP address" is 4 bytes (IPv4) or 16 bytes (IPv6). Therefore, when applications exchange information through well-formed headers, the location factor of the information is very important and cannot be ignored. Therefore, when the present invention uses a traffic classification network model to analyze the characteristics of data packets, it first performs position coding on each byte of the data packet to improve the accuracy of subsequent traffic identification.

在本步骤中，在完成各个字节的位置编码后，将编码后的字节序列输入至流量分类网络模型中，输出流量分类结果。其中，流量分类网路模型由N个自动编码器层构成，N≥2，其可以自动提取潜在特征。自动编码器是一种无监督的神经网络模型，它可以学习到输入数据的隐含特征，这称为编码(coding)，同时用学习到的新特征可以重构出原始输入数据，称之为解码(decoding)。在流量分类网络模型的最后一层连接softmax分类器层，用以生成流量分类结果。In this step, after completing the position encoding of each byte, the encoded byte sequence is input into the traffic classification network model and the traffic classification result is output. Among them, the traffic classification network model consists of N autoencoder layers, N≥2, which can automatically extract latent features. The autoencoder is an unsupervised neural network model that can learn the hidden features of the input data, which is called coding. At the same time, the new features learned can be used to reconstruct the original input data, which is called coding. decoding. The softmax classifier layer is connected to the last layer of the traffic classification network model to generate traffic classification results.

本发明提供的网络流量分类方法，通过将捕获的pcap文件切分为流序列，从流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列。然后对字节序列中的各个字节进行位置编码，并将编码后的字节序列输入至流量分类网络模型中，输出流量分类结果。本发明从原始流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列作为流量分类网络模型的输入，相比于现有的手工提取流的统计特征，既减少了模型输入数据的规模，又充分挖掘了流量数据的时序特征。此外，本发明为字节序列中的每个字节分别进行位置编码，可以有效提取字节序列中各字节的关键位置信息，提高流量分类网络模型识别的准确率。The network traffic classification method provided by the present invention divides the captured pcap file into a flow sequence, extracts the byte characteristics of each traffic data packet from the flow sequence, and obtains a byte sequence in units of flows. Then each byte in the byte sequence is position-encoded, and the encoded byte sequence is input into the traffic classification network model to output the traffic classification result. The present invention extracts the byte characteristics of each traffic data packet from the original flow sequence, and obtains the byte sequence in flow units as the input of the traffic classification network model. Compared with the existing manual extraction of statistical characteristics of the flow, it not only reduces The scale of model input data fully exploits the time series characteristics of traffic data. In addition, the present invention separately performs position coding for each byte in the byte sequence, which can effectively extract the key position information of each byte in the byte sequence and improve the accuracy of traffic classification network model identification.

基于上述实施例的内容，在本实施例中，所述将捕获的pcap文件切分为流序列，包括：Based on the content of the above embodiment, in this embodiment, dividing the captured pcap file into flow sequences includes:

在本实施例中，将流序列作为唯一的原始流量分类单位，以此将加密流量分类到特定的应用程序中。原始流可以表示为具有相同流长度和不同类型(例如消息类型序列和分组长度序列)的多个序列。本发明中，将pcap文件中的原始流量集合P分割成多个子集的集合F＝{f¹,…,fⁱ,…,f^m}，m为原始流量划分的子集个数，fⁱ表示将原始流量划分为多个子集中的任一子流。子流fⁱ＝(xⁱ,dⁱ,tⁱ)中的数据包以时间顺序排列，其中xⁱ表示包括源IP地址、源端口、目的IP地址、目的端口和协议号的五元组；dⁱ是子流fⁱ传输的总时长；tⁱ则是子流fⁱ中的数据包流的第一个数据包开始传输的时间。In this embodiment, the flow sequence is used as the only original traffic classification unit to classify the encrypted traffic into specific applications. The original stream can be represented as multiple sequences with the same stream length and different types (such as message type sequence and packet length sequence). In the present invention, the original traffic set P in the pcap file is divided into a plurality of subset sets F = {f ¹ ,..., ^fi ,..., f ^m }, m is the number of subsets divided by the original traffic, ^fi Indicates dividing the original traffic into any subflow among multiple subsets. The data packets in subflow ^fi = (xi ^, di ^, ^ti ) are arranged in time order, where ^xi represents a five-tuple including source IP address, source port, destination IP address, destination port and protocol number; d ⁱ is the total transmission time of sub-flow ^fi ; t ⁱ is the time when the first packet of the packet stream in sub-flow ^fi starts to be transmitted.

基于上述实施例的内容，在本实施例中，从所述流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列，包括：Based on the content of the above embodiment, in this embodiment, the byte characteristics of each traffic data packet are extracted from the flow sequence to obtain a byte sequence in units of flows, including:

在本实施例中，可选的，选择初始值M＝40作为每个数据包采用的字节数(因为报头位于数据包的包头，而TCP的公共报头长度大于UDP的报头长度，为40字节)。另外，考虑到常用的随机端口分配和网络地址转换技术可能会对分类结果产生混淆，本发明用零来替代IP地址和端口号来避免这种影响。因此，单个数据包的字节处理结果为D＝{d₁,…,d_i,…,d_M}，其中d_i为数据包的第i(i<M)个字节，取值范围为[0,255]，归一化处理，使取值范围为[0,1]。对于每个以流为单位的序列样本f_i，字节序列F_packet如下述所示，N的大小等于数据包的数量。In this embodiment, optionally, the initial value M=40 is selected as the number of bytes used in each data packet (because the header is located at the header of the data packet, and the length of the public header of TCP is greater than the length of the header of UDP, which is 40 words. Festival). In addition, considering that commonly used random port allocation and network address translation techniques may confuse the classification results, the present invention uses zeros to replace IP addresses and port numbers to avoid this effect. Therefore, the byte processing result of a single data packet is D={d ₁ ,...,d _i ,...,d _M }, where _di is the i-th (i<M) byte of the data packet, and the value range is [0,255], normalized so that the value range is [0,1]. For each sequence sample _fi in flow units, the byte sequence F _packet is as follows, with the size of N equal to the number of packets.

由此可见，本发明截取流数据包字节序列前40字节，包括包头的一部分作为代表性的特征，既减少了流量分类网络模型输入数据的规模，又充分挖掘了流量数据的时序特征。It can be seen that the present invention intercepts the first 40 bytes of the flow data packet byte sequence, including part of the packet header as a representative feature, which not only reduces the scale of the input data of the traffic classification network model, but also fully exploits the timing characteristics of the traffic data.

基于上述实施例的内容，在本实施例中，所述对所述字节序列中的各个字节进行位置编码，包括：Based on the content of the above embodiment, in this embodiment, position encoding of each byte in the byte sequence includes:

P(pos,2i)＝sin(pos/m^2i/d)P(pos,2i)＝sin(pos/m ^2i/d )

P(pos,2i+1)＝cos(pos/m²ⁱ)P(pos,2i+1)＝cos(pos/m ²ⁱ )

在本实施例中，可选的，采用三角函数对位置进行编码，将每个字节的位置position转化为d维的向量P_pos：In this embodiment, optionally, a trigonometric function is used to encode the position, and the position of each byte is converted into a d-dimensional vector P _pos :

P(pos,2i)＝sin(pos/10000^2i/d)P(pos,2i)＝sin(pos/10000 ^2i/d )

P(pos,2i+1)＝cos(pos/10000²ⁱ)P(pos,2i+1)＝cos(pos/10000 ²ⁱ )

其中，2i,2i+1∈[0,d-1]，表示生成的位置编码的每个通道，而设置常数10000保证了每个位置都能对应唯一的位置编码。Among them, 2i, 2i+1∈[0,d-1] represents each channel of the generated position code, and setting the constant 10000 ensures that each position can correspond to a unique position code.

由此可见，本发明将字节序列中的各个字节进行位置编码，以充分挖掘数据包字节序列中各字节所包含的上下文关键信息，本发明通过三角函数编码方式，使得流量分类网络模型可以更容易地学习关注字节相对位置，同时它能为每个字节输出一个独一无二的编码。位置编码使流量分类网络模型知道每个字节在字节序列中的相对和绝对的位置信息。It can be seen that the present invention performs position coding on each byte in the byte sequence to fully mine the contextual key information contained in each byte in the data packet byte sequence. The present invention uses trigonometric function encoding to enable the traffic classification network The model can more easily learn to focus on relative byte positions, and it can output a unique encoding for each byte. Positional encoding enables the traffic classification network model to know the relative and absolute position information of each byte in the byte sequence.

基于上述实施例的内容，在本实施例中，所述流量分类网络模型由N个自动编码器层构成，N≥2，所述流量分类网络模型的损失函数为：Based on the content of the above embodiment, in this embodiment, the traffic classification network model is composed of N autoencoder layers, N≥2, and the loss function of the traffic classification network model is:

在本实施例中，在完成各个字节的位置编码后，将编码后的字节序列输入至流量分类网络模型中，输出流量分类结果。其中，流量分类网路模型由N个自动编码器层构成，N≥2，其可以自动提取潜在特征。自动编码器是一种无监督的神经网络模型，它可以学习到输入数据的隐含特征，这称为编码(coding)，同时用学习到的新特征可以重构出原始输入数据，称之为解码(decoding)。在流量分类网络模型的最后一层连接softmax分类器层，用以生成流量分类结果。本发明通过多层编码器充分提取更高层次的信息，进一步提高分类器性能。In this embodiment, after the position encoding of each byte is completed, the encoded byte sequence is input into the traffic classification network model and the traffic classification result is output. Among them, the traffic classification network model consists of N autoencoder layers, N≥2, which can automatically extract latent features. The autoencoder is an unsupervised neural network model that can learn the hidden features of the input data, which is called coding. At the same time, the new features learned can be used to reconstruct the original input data, which is called coding. decoding. The softmax classifier layer is connected to the last layer of the traffic classification network model to generate traffic classification results. The present invention fully extracts higher-level information through multi-layer encoders and further improves classifier performance.

具体的，对于单个编码器，输入层为其中d_i-1是输入层h_i-1的维度，隐藏层为/>其中d_i是隐藏层的维度。根据下述公式，编码过程为：输入层h_i-1被映射到隐藏层h_i，解码过程为：隐藏层h_i被映射到输出层/> Specifically, for a single encoder, the input layer is where di _-1 is the dimension of the input layer h _i-1 , and the hidden layer is/> where _di is the dimension of the hidden layer. According to the following formula, the encoding process is: the input layer h _i-1 is mapped to the hidden layer h _i , and the decoding process is: the hidden layer h _i is mapped to the output layer/>

h_i＝f(W_i,1h_i-1+b_i,1)h _i =f(W _i,1 h _i-1 +b _i,1 )

其中W_i,1(d_i×d_i-1)和W_i,2(d_i-1×d_i)是编码器和解码器的权重矩阵，和是偏置向量，激活函数f(·)和/>通常用sigmoid函数。where W _i,1 (d _i ×d _i-1 ) and W _i,2 (d _i-1 ×d _i ) are the weight matrices of the encoder and decoder, and is the bias vector, activation function f(·) and/> Usually the sigmoid function is used.

通常意义上，第i个自动编码器试图重建输入h_i-1，使尽可能与h_i-1相似。因此流量分类网络模型的目标是使重建误差尽可能小，损失函数如下式：In a general sense, the i-th autoencoder attempts to reconstruct the input h _i-1 such that Be as similar as possible to h _i-1 . Therefore, the goal of the traffic classification network model is to make the reconstruction error as small as possible, and the loss function is as follows:

对于堆叠自动编码器，假设样本作为原始数据输入到单个的编码器中，得到的编码特征重新作为下一个编码器的输入。堆叠式编码器按损失函数进行训练，最终生成更为抽象的特征。对于应用程序识别任务，在所提出的流量分类网络模型SAE(Stacked AutoEncoder)的最后一层，连接softmax分类器层，生成流量分类结果。For stacked autoencoders, it is assumed that samples are input to a single encoder as raw data, and the resulting encoding features are reused as input to the next encoder. The stacked encoder is trained on a loss function and ultimately generates more abstract features. For the application identification task, in the last layer of the proposed traffic classification network model SAE (Stacked AutoEncoder), the softmax classifier layer is connected to generate traffic classification results.

由此可见，本发明针对现代互联网环境下网络加密流量分类问题，设计了一种基于改进位置编码的网络流量数据包特征分类方法，从流序列中截取各个数据包部分字节序列，并对处理后的字节序列进行三角函数位置编码，以流量分类网络模型SAE提取流量数据代表性特征，提高分类模型的准确率。本发明通过对字节序列进行位置编码，可以有效提取字节序列各字节的关键位置信息，并显著提高深度学习流量分类模型的准确率。It can be seen that, aiming at the problem of network encrypted traffic classification in the modern Internet environment, the present invention designs a network traffic data packet feature classification method based on improved position coding, intercepts the partial byte sequence of each data packet from the flow sequence, and processes The resulting byte sequence is encoded with a trigonometric function position, and the traffic classification network model SAE is used to extract representative features of the traffic data to improve the accuracy of the classification model. By position encoding the byte sequence, the present invention can effectively extract the key position information of each byte of the byte sequence, and significantly improve the accuracy of the deep learning traffic classification model.

下面通过具体实施例进行说明：The following is explained through specific examples:

实施例一：Example 1:

在本实施例中，需要说明的是，准确的流量分类已成为高级网络管理任务的先决条件之一，例如提供适当的服务质量QoS(Quality of Service)、异常检测、流量定价等。同时，用户隐私和数据加密的日益增长需求极大地增加了当今互联网中的加密流量。加密程序将原始数据转换为类似伪随机的格式，目的是使其难以解密。结果导致加密数据几乎不包含任何用于识别网络流量的判别模式。因此，加密流量的准确分类已成为现代网络中的真正挑战。另外，现有的网络流量分类方法，例如有效载荷检查以及基于机器学习和基于统计的方法，都需要专家提取模式或特征，此过程容易出错、耗时且成本高昂。最后，许多互联网服务提供商由于其高带宽消耗和版权问题而阻止文件共享应用程序。为了规避这个问题，这些应用程序使用协议嵌入和混淆技术来绕过流量控制系统，因此，识别此类应用程序是网络流量分类中最具挑战性的任务之一。In this embodiment, it should be noted that accurate traffic classification has become one of the prerequisites for advanced network management tasks, such as providing appropriate QoS (Quality of Service), anomaly detection, traffic pricing, etc. At the same time, the growing need for user privacy and data encryption has greatly increased the amount of encrypted traffic in today's Internet. Encryption programs convert raw data into a pseudo-random-like format with the goal of making it difficult to decrypt. The result is that encrypted data contains almost no discriminative patterns for identifying network traffic. Therefore, accurate classification of encrypted traffic has become a real challenge in modern networks. Additionally, existing network traffic classification methods, such as payload inspection as well as machine learning-based and statistics-based methods, require experts to extract patterns or features, a process that is error-prone, time-consuming, and costly. Finally, many Internet service providers block file sharing applications due to their high bandwidth consumption and copyright issues. To circumvent this problem, these applications use protocol embedding and obfuscation techniques to bypass traffic control systems, therefore identifying such applications is one of the most challenging tasks in network traffic classification.

上述现有的网络流量分类方法缺陷如下：The above-mentioned existing network traffic classification methods have the following shortcomings:

(1)基于负载数据报文的字节流的字节分布特征进行应用程序流量分类，生成字节分布特征的过程依赖大量的专家经验，可能会产生偏差导致分类结果的差异。(1) Classify application traffic based on the byte distribution characteristics of the byte stream of load data packets. The process of generating byte distribution characteristics relies on a large amount of expert experience, which may cause deviations and lead to differences in classification results.

(2)将数据包荷载数据转化成字节序列，字节序列后续再输入到一维神经网络进行训练提取特征，并基于此进行流量分类。这种方法没有采用包头中的信息，事实上包头中包含了大量有用信息。同时现有技术是基于数据包进行分类的，分类单位的尺度较小，不适用于流级别的流量分类。(2) Convert the packet payload data into a byte sequence. The byte sequence is then input into the one-dimensional neural network for training and feature extraction, and traffic classification is performed based on this. This method does not use the information in the header, which actually contains a lot of useful information. At the same time, the existing technology is classified based on data packets, and the scale of the classification unit is small, which is not suitable for flow-level traffic classification.

随着网络技术和加密技术的飞速发展，网络安全问题越来越受到大众的关注，网络加密流量的规模不断增加，给网络流量分类带来了巨大挑战。将机器学习算法与人工设计相结合已经成为解决这一问题的主流方法，但它需要大量的人力来提取和处理特征，这在很大程度上依赖于专业经验。然而，机器学习方法的成功诉诸于手工设计的特征的质量。当网络流量环境向快节奏的移动流量演变时，这样的过程是不切实际的，因为它既不能自动化也不能实现高度专业化。近年来深度学习方法被应用在加密流量分类领域以实现高性能的分类器，深度学习流量分类方法允许通过自动提取结构化和复杂的特征表示直接从输入数据训练分类器，这相对于传统机器学习方法有极大的优势，但深度学习方法仍有一些问题，如何从原始流量数据中提取特征，用深度学习模型进行精确的应用程序分类是相关领域研究者面临的一大问题。With the rapid development of network technology and encryption technology, network security issues have attracted more and more public attention. The scale of network encrypted traffic continues to increase, which brings huge challenges to network traffic classification. Combining machine learning algorithms with human design has become a mainstream method to solve this problem, but it requires a lot of manpower to extract and process features, which relies heavily on professional experience. However, the success of machine learning methods relies on the quality of hand-designed features. As the network traffic environment evolves toward fast-paced mobile traffic, such a process is impractical because it can neither be automated nor highly specialized. In recent years, deep learning methods have been applied in the field of encrypted traffic classification to achieve high-performance classifiers. Deep learning traffic classification methods allow training classifiers directly from input data by automatically extracting structured and complex feature representations, which is compared to traditional machine learning. The method has great advantages, but the deep learning method still has some problems. How to extract features from the original traffic data and use the deep learning model to accurately classify applications is a major problem faced by researchers in related fields.

在本发明中，将基于原始流作为基本分类单位，提取流的数据包特征，并利用位置编码转化数据包字节序列，以在提取数据包特征的过程中同时保留字节的位置信息。本发明采用堆叠式自动编码器SAE作为分类模型，提取更高层次的流包特征，使应用程序的分类更为准确。In the present invention, the data packet characteristics of the flow are extracted based on the original stream as the basic classification unit, and the data packet byte sequence is converted using position coding to retain the position information of the bytes during the process of extracting the data packet characteristics. The present invention uses the stacked autoencoder SAE as a classification model to extract higher-level flow packet features to make application classification more accurate.

如图2所示，本发明提供的一种网络流量分类方法，包括：As shown in Figure 2, a network traffic classification method provided by the present invention includes:

步骤201：从流序列中提取各个流量数据包具有代表性的字节序列；Step 201: Extract the representative byte sequence of each traffic data packet from the flow sequence;

在本步骤中，首先使用网络流量监控软件实时、连续地将应用程序使用期间产生的一系列流量数据包捕捉并实时存储到内部或外部存储器上。然后将原始流量转化为流序列后按照源IP地址、源端口、目的IP地址、目的端口和协议号切分原始数据包流，从流序列中提取各个数据包前40个字节进行归一化处理。In this step, network traffic monitoring software is first used to capture a series of traffic packets generated during application use in real time and continuously and store them in internal or external memory in real time. Then convert the original traffic into a flow sequence and segment the original packet flow according to the source IP address, source port, destination IP address, destination port and protocol number, and extract the first 40 bytes of each packet from the flow sequence for normalization deal with.

步骤202：对字节序列中的各个字节进行位置编码，并采用堆叠式自动编码器SAE对字节序列进行分析；Step 202: Positionally encode each byte in the byte sequence, and use the stacked autoencoder SAE to analyze the byte sequence;

在本步骤中，建立基于位置编码和SAE的流量分类网络模型，首先对数据包的各字节进行位置编码，再将编码后的字节序列输入到SAE编码器中进行分析。本发明采用的SAE架构由三个完全连接的层组成，彼此堆叠在一起，分别由64、32、16个神经元组成。为了防止过拟合问题，在每一层之后采用了0.05的随机失活率。In this step, a traffic classification network model based on position coding and SAE is established. First, position encoding is performed on each byte of the data packet, and then the encoded byte sequence is input into the SAE encoder for analysis. The SAE architecture used in the present invention consists of three fully connected layers stacked on top of each other, consisting of 64, 32, and 16 neurons respectively. To prevent overfitting problems, a random dropout rate of 0.05 is adopted after each layer.

步骤203：通过训练后的SAE模型输出流量应用分类结果。Step 203: Output the traffic application classification results through the trained SAE model.

在本步骤中，训练堆叠式编码器SAE，模型最后接softmax层输出分类结果。部分流分类结果如下表1所示。In this step, the stacked encoder SAE is trained, and the model is finally connected to the softmax layer to output the classification results. Some traffic classification results are shown in Table 1 below.

表1Table 1

下面对本发明提供的流量分类装置进行描述，下文描述的流量分类装置与上文描述的流量分类方法可相互对应参照。The traffic classification device provided by the present invention will be described below. The traffic classification device described below and the traffic classification method described above can be referenced correspondingly.

如图3所示，本发明提供的网络流量分类装置，包括：As shown in Figure 3, the network traffic classification device provided by the present invention includes:

第一处理模块1，用于将捕获的pcap文件切分为流序列，所述流序列由多个流量数据包组成；The first processing module 1 is used to segment the captured pcap file into a flow sequence, where the flow sequence is composed of multiple traffic data packets;

第二处理模块2，用于从所述流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列；The second processing module 2 is used to extract the byte characteristics of each traffic data packet from the flow sequence to obtain a byte sequence in units of flows;

第三处理模块3，用于对所述字节序列中的各个字节进行位置编码，并将编码后的所述字节序列输入至流量分类网络模型中，得到所述流量分类网络模型输出的流量分类结果；其中，所述流量分类网络模型是基于以流为单位的样本和样本对应的流量分类结果训练后得到的。The third processing module 3 is used to perform position encoding on each byte in the byte sequence, and input the encoded byte sequence into the traffic classification network model to obtain the output of the traffic classification network model. Traffic classification results; wherein, the traffic classification network model is obtained after training based on samples in flow units and traffic classification results corresponding to the samples.

在本实施例中，首先在链路连接中捕获应用程序的pcap文件，并按照五元组切分pcap文件中的数据包流，得到由多个数据包组成的流序列。In this embodiment, the pcap file of the application is first captured in the link connection, and the data packet flow in the pcap file is divided according to five-tuple to obtain a flow sequence composed of multiple data packets.

在本实施例中，需要说明的是，数据包包含IP报头、传输层报头和有效负载。不同位置的字节往往代表着不同的含义，且相互影响。例如，IP头中的“版本”决定了“源IP地址”的长度是4字节(IPv4)或16字节(IPv6)。因此，当应用程序通过格式良好的头交换信息时，其位置因素的信息时非常重要，不可忽视的。因此，本发明在采用流量分类网络模型对数据包特征进行分析时，首先对数据包的各字节进行位置编码，以提高后续流量识别的准确性。In this embodiment, it should be noted that the data packet includes an IP header, a transport layer header and a payload. Bytes in different positions often represent different meanings and influence each other. For example, the "version" in the IP header determines whether the length of the "source IP address" is 4 bytes (IPv4) or 16 bytes (IPv6). Therefore, when applications exchange information through well-formed headers, the location factor of the information is very important and cannot be ignored. Therefore, when the present invention uses a traffic classification network model to analyze the characteristics of data packets, it first performs position coding on each byte of the data packet to improve the accuracy of subsequent traffic identification.

在本实施例中，在完成各个字节的位置编码后，将编码后的字节序列输入至流量分类网络模型中，输出流量分类结果。其中，流量分类网路模型由N个自动编码器层构成，N≥2，其可以自动提取潜在特征。自动编码器是一种无监督的神经网络模型，它可以学习到输入数据的隐含特征，这称为编码(coding)，同时用学习到的新特征可以重构出原始输入数据，称之为解码(decoding)。在流量分类网络模型的最后一层连接softmax分类器层，用以生成流量分类结果。In this embodiment, after the position encoding of each byte is completed, the encoded byte sequence is input into the traffic classification network model and the traffic classification result is output. Among them, the traffic classification network model consists of N autoencoder layers, N≥2, which can automatically extract latent features. The autoencoder is an unsupervised neural network model that can learn the hidden features of the input data, which is called coding. At the same time, the new features learned can be used to reconstruct the original input data, which is called coding. decoding. The softmax classifier layer is connected to the last layer of the traffic classification network model to generate traffic classification results.

本发明提供的网络流量分类装置，通过将捕获的pcap文件切分为流序列，从流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列。然后对字节序列中的各个字节进行位置编码，并将编码后的字节序列输入至流量分类网络模型中，输出流量分类结果。本发明从原始流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列作为流量分类网络模型的输入，相比于现有的手工提取流的统计特征，既减少了模型输入数据的规模，又充分挖掘了流量数据的时序特征。此外，本发明为字节序列中的每个字节分别进行位置编码，可以有效提取字节序列中各字节的关键位置信息，提高流量分类网络模型识别的准确率。The network traffic classification device provided by the present invention divides the captured pcap file into a flow sequence, extracts the byte characteristics of each traffic data packet from the flow sequence, and obtains a byte sequence in units of flows. Then each byte in the byte sequence is position-encoded, and the encoded byte sequence is input into the traffic classification network model to output the traffic classification result. The present invention extracts the byte characteristics of each traffic data packet from the original flow sequence, and obtains the byte sequence in flow units as the input of the traffic classification network model. Compared with the existing manual extraction of statistical characteristics of the flow, it not only reduces The scale of model input data fully exploits the time series characteristics of traffic data. In addition, the present invention separately performs position coding for each byte in the byte sequence, which can effectively extract the key position information of each byte in the byte sequence and improve the accuracy of traffic classification network model identification.

图4示例了一种电子设备的实体结构示意图，如图4所示，该电子设备可以包括：处理器(processor)410、通信接口(Communications Interface)420、存储器(memory)430和通信总线440，其中，处理器410，通信接口420，存储器430通过通信总线440完成相互间的通信。处理器410可以调用存储器430中的逻辑指令，以执行网络流量分类方法，该方法包括：将捕获的pcap文件切分为流序列，所述流序列由多个流量数据包组成；从所述流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列；对所述字节序列中的各个字节进行位置编码，并将编码后的所述字节序列输入至流量分类网络模型中，得到所述流量分类网络模型输出的流量分类结果；其中，所述流量分类网络模型是基于以流为单位的样本和样本对应的流量分类结果训练后得到的。Figure 4 illustrates a schematic diagram of the physical structure of an electronic device. As shown in Figure 4, the electronic device may include: a processor (processor) 410, a communications interface (Communications Interface) 420, a memory (memory) 430 and a communication bus 440. Among them, the processor 410, the communication interface 420, and the memory 430 complete communication with each other through the communication bus 440. The processor 410 can call logical instructions in the memory 430 to perform a network traffic classification method, which method includes: dividing the captured pcap file into a flow sequence, the flow sequence is composed of a plurality of traffic data packets; from the flow Extract the byte characteristics of each traffic data packet in the sequence to obtain a byte sequence in units of streams; perform position encoding on each byte in the byte sequence, and input the encoded byte sequence into the traffic flow In the classification network model, the traffic classification results output by the traffic classification network model are obtained; wherein the traffic classification network model is obtained after training based on samples in flow units and the traffic classification results corresponding to the samples.

此外，上述的存储器430中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logical instructions in the memory 430 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present invention essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code. .

另一方面，本发明还提供一种计算机程序产品，所述计算机程序产品包括计算机程序，计算机程序可存储在非暂态计算机可读存储介质上，所述计算机程序被处理器执行时，计算机能够执行上述各方法所提供的网络流量分类方法，该方法包括：将捕获的pcap文件切分为流序列，所述流序列由多个流量数据包组成；从所述流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列；对所述字节序列中的各个字节进行位置编码，并将编码后的所述字节序列输入至流量分类网络模型中，得到所述流量分类网络模型输出的流量分类结果；其中，所述流量分类网络模型是基于以流为单位的样本和样本对应的流量分类结果训练后得到的。On the other hand, the present invention also provides a computer program product. The computer program product includes a computer program. The computer program can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can Executing the network traffic classification method provided by each of the above methods, the method includes: dividing the captured pcap file into a flow sequence, the flow sequence is composed of multiple traffic data packets; extracting each traffic data packet from the flow sequence The byte characteristics of the byte sequence are obtained to obtain a byte sequence in units of streams; perform position encoding on each byte in the byte sequence, and input the encoded byte sequence into the traffic classification network model to obtain the The traffic classification result output by the traffic classification network model; wherein, the traffic classification network model is obtained after training based on samples in flow units and the traffic classification results corresponding to the samples.

又一方面，本发明还提供一种非暂态计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以执行上述各方法提供的网络流量分类方法，该方法包括：将捕获的pcap文件切分为流序列，所述流序列由多个流量数据包组成；从所述流序列中提取各个流量数据包的字节特征，得到以流为单位的字节序列；对所述字节序列中的各个字节进行位置编码，并将编码后的所述字节序列输入至流量分类网络模型中，得到所述流量分类网络模型输出的流量分类结果；其中，所述流量分类网络模型是基于以流为单位的样本和样本对应的流量分类结果训练后得到的。In another aspect, the present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored. The computer program is implemented when executed by a processor to perform the network traffic classification method provided by each of the above methods. The method includes : Divide the captured pcap file into a flow sequence, which is composed of multiple traffic data packets; extract the byte characteristics of each traffic data packet from the flow sequence to obtain a byte sequence in units of flows; Positionally encode each byte in the byte sequence, and input the encoded byte sequence into the traffic classification network model to obtain the traffic classification result output by the traffic classification network model; wherein, The traffic classification network model is trained based on samples in flow units and the traffic classification results corresponding to the samples.

以上所描述的装置实施例仅仅是示意性的，其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下，即可以理解并实施。The device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated. The components shown as units may or may not be physical units, that is, they may be located in One location, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. Persons of ordinary skill in the art can understand and implement the method without any creative effort.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件。基于这样的理解，上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在计算机可读存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and of course, it can also be implemented by hardware. Based on this understanding, the part of the above technical solution that essentially contributes to the existing technology can be embodied in the form of a software product. The computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., including a number of instructions to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that it can still be used Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent substitutions are made to some of the technical features; however, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for classifying network traffic, comprising:

capturing a pcap file of an application program in link connection, segmenting a data packet stream in the pcap file according to five-tuple, and taking an original stream as a basic classification unit to obtain a stream sequence consisting of a plurality of data packets, wherein the stream sequence consists of a plurality of flow data packets;

extracting byte characteristics of each flow data packet from the flow sequence to obtain a byte sequence taking the flow as a unit;

position coding is carried out on each byte in the byte sequence, and the coded byte sequence is input into a flow classification network model to obtain a flow classification result output by the flow classification network model; the flow classification network model is obtained after training based on a sample taking a flow as a unit and a flow classification result corresponding to the sample;

the performing position coding on each byte in the byte sequence comprises the following steps:

position encoding is performed on each byte in the byte sequence based on the following formula, and the position of each byte in a data packet is converted into a d-dimensional feature vector P _pos The formula is as follows:

P(pos，2i)＝sin(pos/m ^2i/d )

P(pos，2i+1)＝cos(pos/m ²ⁱ )

wherein 2i,2i+1 ε [0, d-1], represent each channel of the generated position code, m is a constant for making the position of each byte correspond to a unique position code.

2. The method of classifying network traffic according to claim 1, wherein,

the five-tuple comprises: source IP address, source port, destination IP address, destination port, and protocol number.

3. The network traffic classification method according to claim 1, wherein extracting byte characteristics of each traffic packet from the stream sequence to obtain a byte sequence in stream units comprises:

and extracting the byte characteristics of the preset quantity of each flow data packet from the flow sequence based on a preset rule to obtain a byte sequence taking the flow as a unit.

4. The network traffic classification method according to claim 1, wherein the traffic classification network model is composed of N automatic encoder layers, N is equal to or greater than 2, and the loss function of the traffic classification network model is:

wherein h is _i-1 For the input layer of the i-th auto-encoder, N is the number of samples in stream units.

5. A network traffic classification device, comprising:

the first processing module is used for capturing a pcap file of an application program in link connection, segmenting a data packet stream in the pcap file according to a five-tuple, taking an original stream as a basic classification unit, and obtaining a stream sequence consisting of a plurality of data packets, wherein the stream sequence consists of a plurality of flow data packets;

the second processing module is used for extracting byte characteristics of each flow data packet from the flow sequence to obtain a byte sequence taking the flow as a unit;

the third processing module is used for carrying out position coding on each byte in the byte sequence, inputting the coded byte sequence into a flow classification network model and obtaining a flow classification result output by the flow classification network model; the flow classification network model is obtained after training based on a sample taking a flow as a unit and a flow classification result corresponding to the sample; the third processing module is specifically configured to: position encoding is performed on each byte in the byte sequence based on the following formula, and the position of each byte in a data packet is converted into a d-dimensional feature vector P _pos The formula is as follows:

P(pos，2i)＝sin(pos/m ^2i/d )

P(pos，2i+1)＝cos(pos/m ²ⁱ )

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the network traffic classification method according to any of claims 1 to 4 when the program is executed.

7. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the network traffic classification method according to any of claims 1 to 4.