CN111723846B - Encrypted and compressed traffic identification method and device based on random characteristics - Google Patents
Encrypted and compressed traffic identification method and device based on random characteristics Download PDFInfo
- Publication number
- CN111723846B CN111723846B CN202010432177.4A CN202010432177A CN111723846B CN 111723846 B CN111723846 B CN 111723846B CN 202010432177 A CN202010432177 A CN 202010432177A CN 111723846 B CN111723846 B CN 111723846B
- Authority
- CN
- China
- Prior art keywords
- data
- flow
- ecf
- traffic
- machine learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/50—Testing arrangements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Computer Security & Cryptography (AREA)
- Computing Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Environmental & Geological Engineering (AREA)
- Medical Informatics (AREA)
- Mathematical Physics (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
本发明属于网络流量数据分类技术领域,公开一种基于随机性特征的加密和压缩流量识别方法,包括:对网络数据进行采集,并解析得到流量数据;计算并得到流量数据的随机性特征ECF特征向量;所述ECF特征向量包括:卡方、Renyi交叉熵、单比特频数、块内频数、游程、最大游程、傅里叶变换、非重叠匹配、序列化和累加和;以ECF特征向量为输入,通过机器学习模型进行识别,识别结果包括加密流量和压缩流量;本发明还公开一种基于随机性特征的加密和压缩流量识别装置。本发明结合机器学习,构造了有效的随机性特征ECF特征向量,在获取部分数据或者数据量较少的情况下,依旧能够对加密和压缩流量进行较高精度的识别。
The invention belongs to the technical field of network traffic data classification and discloses an encryption and compression traffic identification method based on random characteristics, which includes: collecting network data and parsing to obtain the traffic data; calculating and obtaining the random characteristics ECF characteristics of the traffic data. Vector; the ECF feature vector includes: chi-square, Renyi cross entropy, single-bit frequency, intra-block frequency, run length, maximum run length, Fourier transform, non-overlapping matching, serialization and cumulative sum; take the ECF feature vector as input , identification is performed through a machine learning model, and the identification results include encrypted traffic and compressed traffic; the invention also discloses an encryption and compressed traffic identification device based on random characteristics. The present invention combines machine learning to construct an effective random feature ECF feature vector, which can still identify encrypted and compressed traffic with high accuracy even when partial data is obtained or the amount of data is small.
Description
技术领域Technical field
本发明属于网络流量数据分类技术领域,尤其涉及一种基于随机性特征的加密和压缩流量识别方法及装置。The invention belongs to the technical field of network traffic data classification, and in particular relates to a method and device for encrypted and compressed traffic identification based on random characteristics.
背景技术Background technique
流量识别作为网络管理的基础工作,一直是网络管理人员的研究重点。随着HTTPS、SSL、VPN等加密技术的普及和广泛使用,加密流量识别已经成为流量识别的主要工作。另外,由于压缩算法具备易于实现、无需密钥交互、压缩数据随机性强、相较明文数据量少等优点,大量恶意行为为躲避流量监管和行为检测,掩盖其实际通信内容,在通信过程中使用压缩传输来躲避安全检测。同时近年来无线通讯技术获得了快速发展,无线网络流量的增长异常迅猛。随着5G网络普及和物联网(IoT)设备的大规模应用,无线网络中也存在大量的压缩和加密数据。如何检测无线通信采用了安全的保密通信。综上所述,无论在网络管理,还是设备安全检测和网络安全检测工作中,都需要有效的方法来对加密和压缩的数据进行识别。As the basic work of network management, traffic identification has always been the research focus of network managers. With the popularization and widespread use of encryption technologies such as HTTPS, SSL, and VPN, encrypted traffic identification has become the main task of traffic identification. In addition, because the compression algorithm has the advantages of being easy to implement, requiring no key interaction, strong randomness of compressed data, and smaller amount of data compared to plaintext, a large number of malicious behaviors avoid traffic supervision and behavior detection and cover up their actual communication content during the communication process. Use compressed transmission to avoid security detection. At the same time, wireless communication technology has developed rapidly in recent years, and wireless network traffic has grown extremely rapidly. With the popularization of 5G networks and the large-scale application of Internet of Things (IoT) devices, there is also a large amount of compressed and encrypted data in wireless networks. How to detect wireless communications using secure confidential communications. To sum up, no matter in network management, equipment security detection and network security detection work, effective methods are needed to identify encrypted and compressed data.
当前在压缩数据和加密数据识别方面,主要有标识字段检测、解压穷尽、统计特性分析等方法。标识字段检测是通过识别特殊文件标识(Malhotra P.Detection ofencrypted streams for egress monitoring[J].Dissertations&Theses-Gradworks,2007.)来区别加密或压缩流量。一般情况下压缩文件的标识主要集中在文件头部或者尾部,但是压缩数据在网络传输时,通常会被分解成很多数据包,而包含文件标识的数据包往往只有一个或者两个。如果只能截获流量的少量数据包,那么就不能用这种方法来判断流量性质(加密或压缩)。第二类方法是通过穷尽解压缩的方法来识别压缩数据(Conte T M,Wolfe A.Techniques for detecting encrypted data:U.S.Patent 8,799,671[P].2014-8-5.)。由于压缩算法的数据关联性,要对压缩数据进行解压,就必须获得完整的压缩文件,这对于通过网络传输的压缩文件而言,就要求检测者获取全部的传输数据包并按照正确的顺序拼接,以得到完整的压缩文件用于检测。在网络流量日益激增的信息时代,受处理速度、存储空间等因素的限制,完整捕获、监控网络所有数据包的难度较大,所以得到完整压缩文件并检测的方法,在实际工作中实现难度很大。第三类方法是通过分析压缩或者加密数据的总体表现特性,对网络传输的加密或压缩数据进行识别。这种检测方法只需获取网络通信中的部分数据包载荷,在实际网络流量检测中具有较好的实用性。但是这种方法的识别准确率受获取载荷数据的长度影响较大,在数据长度较短的情况下,识别精度普遍不高。Currently, in terms of compressed data and encrypted data identification, there are mainly methods such as identification field detection, decompression exhaustion, and statistical characteristic analysis. Identity field detection distinguishes encrypted or compressed traffic by identifying special file identifiers (Malhotra P. Detection of encrypted streams for egress monitoring [J]. Dissertations & Theses-Gradworks, 2007.). Generally, the identification of compressed files is mainly concentrated in the header or tail of the file. However, when compressed data is transmitted over the network, it is usually broken down into many data packets, and there are often only one or two data packets containing file identification. If only a small number of packets of traffic can be intercepted, then this method cannot be used to determine the nature of the traffic (encryption or compression). The second type of method is to identify compressed data through exhaustive decompression methods (Conte T M, Wolfe A. Techniques for detecting encrypted data: U.S. Patent 8,799,671[P].2014-8-5.). Due to the data correlation of the compression algorithm, to decompress the compressed data, a complete compressed file must be obtained. For compressed files transmitted through the network, the detector is required to obtain all transmission data packets and splice them in the correct order. , to get the complete compressed file for detection. In the information age where network traffic is increasing day by day, due to limitations of processing speed, storage space and other factors, it is difficult to completely capture and monitor all data packets on the network. Therefore, the method of obtaining complete compressed files and detecting them is very difficult to implement in actual work. big. The third type of method is to identify the encrypted or compressed data transmitted over the network by analyzing the overall performance characteristics of the compressed or encrypted data. This detection method only needs to obtain part of the packet load in network communication, and has good practicability in actual network traffic detection. However, the recognition accuracy of this method is greatly affected by the length of the acquired payload data. When the data length is short, the recognition accuracy is generally not high.
发明内容Contents of the invention
本发明针对现有的加密和压缩流量识别方法精度较低的问题,提出一种基于随机性特征的加密和压缩流量识别的方法及装置,提高了加密和压缩流量的识别精度。In order to solve the problem of low accuracy of existing encryption and compressed traffic identification methods, the present invention proposes a method and device for encrypted and compressed traffic identification based on random characteristics, which improves the identification accuracy of encrypted and compressed traffic.
为了实现上述目的,本发明采用以下技术方案:In order to achieve the above objects, the present invention adopts the following technical solutions:
一种基于随机性特征的加密和压缩流量识别方法,包括:An encrypted and compressed traffic identification method based on random characteristics, including:
步骤1:对网络数据进行采集,并解析得到流量数据;Step 1: Collect network data and parse to obtain traffic data;
步骤2:计算并得到流量数据的随机性特征ECF特征向量;所述ECF特征向量包括:卡方、Renyi交叉熵、单比特频数、块内频数、游程、最大游程、傅里叶变换、非重叠匹配、序列化和累加和;Step 2: Calculate and obtain the random characteristic ECF feature vector of the traffic data; the ECF feature vector includes: chi-square, Renyi cross entropy, single-bit frequency, intra-block frequency, run length, maximum run length, Fourier transform, non-overlapping Matching, serialization and cumulative sums;
步骤3:以ECF特征向量为输入,通过机器学习模型进行识别,识别结果包括加密流量和压缩流量。Step 3: Take the ECF feature vector as input and identify it through the machine learning model. The identification results include encrypted traffic and compressed traffic.
进一步地,所述步骤1包括:Further, the step 1 includes:
步骤1.1:从外部网络中获取数据包,并保存为pcap文件;Step 1.1: Obtain data packets from the external network and save them as pcap files;
步骤1.2:按照五元组将获取的数据包划分成网络流,并保存为flow文件;Step 1.2: Divide the acquired data packets into network flows according to five-tuple groups and save them as flow files;
步骤1.3:对每个flow文件,按照TCP/IP协议格式进行解析,获取数据载荷部分,并按照数据包获取的先后顺序拼接为一个不定长的流量数据。Step 1.3: For each flow file, analyze it according to the TCP/IP protocol format, obtain the data payload part, and splice it into a flow data of variable length according to the order in which the data packets are obtained.
进一步地,所述步骤2包括:Further, the step 2 includes:
步骤2.1:以字节为统计单位,获取流量数据的长度Len;Step 2.1: Use bytes as the statistical unit to obtain the length Len of the traffic data;
步骤2.2:计算获取流量数据的ECF特征向量。Step 2.2: Calculate the ECF feature vector to obtain the traffic data.
进一步地,在所述步骤3之前还包括:Further, before step 3, it also includes:
基于ECF特征向量构建机器学习模型;所述机器学习模型所包含的机器学习算法包括随机森林、Xgboost和MLP。Build a machine learning model based on the ECF feature vector; the machine learning algorithms included in the machine learning model include random forest, Xgboost and MLP.
进一步地,所述步骤3包括:Further, step 3 includes:
步骤3.1:根据流量数据的长度Len和期望的测试精度,选择已经训练好的的机器学习模型进行识别;Step 3.1: Based on the length Len of the traffic data and the expected test accuracy, select the trained machine learning model for identification;
步骤3.2:将流量数据的ECF特征向量,输入机器学习模型得到识别结果,识别结果包括加密流量和压缩流量。Step 3.2: Input the ECF feature vector of the traffic data into the machine learning model to obtain the identification results. The identification results include encrypted traffic and compressed traffic.
一种基于随机性特征的加密和压缩流量识别装置,包括:An encryption and compression traffic identification device based on random characteristics, including:
采集解析模块,用于对网络数据进行采集,并解析得到流量数据;The collection and analysis module is used to collect network data and analyze the traffic data;
特征提取模块,用于计算并得到流量数据的随机性特征ECF特征向量;所述ECF特征向量包括:卡方、Renyi交叉熵、单比特频数、块内频数、游程、最大游程、傅里叶变换、非重叠匹配、序列化和累加和;Feature extraction module, used to calculate and obtain the random characteristic ECF feature vector of traffic data; the ECF feature vector includes: chi-square, Renyi cross entropy, single-bit frequency, intra-block frequency, run length, maximum run length, Fourier transform , non-overlapping matching, serialization and cumulative sum;
流量识别模块,用于以ECF特征向量为输入,通过机器学习模型进行识别,识别结果包括加密流量和压缩流量。The traffic identification module is used to use the ECF feature vector as input to identify through the machine learning model. The identification results include encrypted traffic and compressed traffic.
进一步地,所述采集解析模块包括:Further, the collection and analysis module includes:
数据包获取子模块,用于从外部网络中获取数据包,并保存为pcap文件;The data packet acquisition submodule is used to obtain data packets from the external network and save them as pcap files;
网络流划分子模块,用于按照五元组将获取的数据包划分成网络流,并保存为flow文件;The network flow division sub-module is used to divide the acquired data packets into network flows according to five-tuple groups and save them as flow files;
解析子模块,用于对每个flow文件,按照TCP/IP协议格式进行解析,获取数据载荷部分,并按照数据包获取的先后顺序拼接为一个不定长的流量数据。The parsing sub-module is used to parse each flow file according to the TCP/IP protocol format, obtain the data payload part, and splice it into a flow data of variable length according to the order in which the data packets are obtained.
进一步地,所述特征提取模块包括:Further, the feature extraction module includes:
流量数据长度获取子模块,用于以字节为统计单位,获取流量数据的长度Len;The traffic data length acquisition submodule is used to obtain the length Len of the traffic data in bytes as the statistical unit;
特征提取子模块,用于计算获取流量数据的ECF特征向量。The feature extraction submodule is used to calculate the ECF feature vector of traffic data.
进一步地,还包括:Furthermore, it also includes:
流量识别模型构建模块,用于基于ECF特征向量构建机器学习模型;所述机器学习模型所包含的机器学习算法包括随机森林、Xgboost和MLP。The traffic identification model building module is used to build a machine learning model based on the ECF feature vector; the machine learning algorithms included in the machine learning model include random forest, Xgboost and MLP.
进一步地,所述流量识别模块包括:Further, the traffic identification module includes:
模型选择子模块,用于根据流量数据的长度Len和期望的测试精度,选择已经训练好的的机器学习模型进行识别;The model selection submodule is used to select a trained machine learning model for identification based on the length Len of the traffic data and the expected test accuracy;
流量识别子模块,用于将流量数据的ECF特征向量,输入机器学习模型得到识别结果,识别结果包括加密流量和压缩流量。The traffic identification submodule is used to input the ECF feature vector of traffic data into the machine learning model to obtain identification results. The identification results include encrypted traffic and compressed traffic.
与现有技术相比,本发明具有的有益效果:Compared with the prior art, the present invention has the following beneficial effects:
依照本发明提出的随机性特征ECF,对流量数据进行ECF特征向量提取,将ECF特征向量输入预先训练好的机器学习模型,识别出流量的类型,包括加密流量和压缩流量。传统的基于统计特征的流量识别方法,统计特征选取较少,判别方法单一,对加密和压缩流量的识别精度较低。本发明提供的方法,利用机器学习算法对大数据的学习优势,设计了有效的机器学习识别模型,克服了上述缺点,提高了加密和压缩流量的识别精度,为网络管理工作中网络流量的精细化识别提供了技术支撑,特别是在获取部分数据或者数据量较少的情况下,依旧能够对加密和压缩流量进行较高精度的识别。According to the random feature ECF proposed by the present invention, the ECF feature vector is extracted from the traffic data, and the ECF feature vector is input into the pre-trained machine learning model to identify the type of traffic, including encrypted traffic and compressed traffic. The traditional traffic identification method based on statistical features selects fewer statistical features, has a single discrimination method, and has low identification accuracy for encrypted and compressed traffic. The method provided by the present invention utilizes the learning advantages of machine learning algorithms on big data to design an effective machine learning identification model, overcomes the above shortcomings, improves the identification accuracy of encrypted and compressed traffic, and provides a basis for the refinement of network traffic in network management work. It provides technical support for chemical identification. Especially when partial data is obtained or the amount of data is small, it can still identify encrypted and compressed traffic with high accuracy.
附图说明Description of the drawings
图1为本发明实施例一种基于随机性特征的加密和压缩流量识别方法的流程图之一;Figure 1 is one of the flow charts of an encryption and compression traffic identification method based on randomness characteristics according to an embodiment of the present invention;
图2为本发明与现有相关算法的精度比较曲线图;Figure 2 is a graph comparing the accuracy of the present invention and existing related algorithms;
图3为本发明的机器学习模型在数据长度为1KB的不同数据集间的泛化效果;Figure 3 shows the generalization effect of the machine learning model of the present invention between different data sets with a data length of 1KB;
图4为本发明训练数据集为D_1KB的模型泛化效果;Figure 4 shows the model generalization effect when the training data set of the present invention is D_1KB;
图5为本发明训练数据集为D_64KB的模型泛化效果;Figure 5 shows the model generalization effect when the training data set of the present invention is D_64KB;
图6为D_1KB数据集上训练数据量对精度的影响;Figure 6 shows the impact of the amount of training data on accuracy on the D_1KB data set;
图7为本发明实施例一种基于随机性特征的加密和压缩流量识别方法的流程图之二;Figure 7 is a second flow chart of an encryption and compression traffic identification method based on randomness characteristics according to an embodiment of the present invention;
图8为本发明实施例一种基于随机性特征的加密和压缩流量识别装置的架构示意图。Figure 8 is a schematic diagram of the architecture of an encryption and compression traffic identification device based on random characteristics according to an embodiment of the present invention.
具体实施方式Detailed ways
下面结合附图和具体的实施例对本发明做进一步的解释说明:The present invention will be further explained below in conjunction with the accompanying drawings and specific embodiments:
如图1所示,一种基于随机性特征的加密和压缩流量识别方法,包括:As shown in Figure 1, an encrypted and compressed traffic identification method based on random characteristics includes:
步骤S101:对网络数据进行采集,并解析得到流量数据;Step S101: Collect network data and parse to obtain traffic data;
步骤S102:计算并得到流量数据的随机性特征ECF(Encrypted and Compresseddata Feature)特征向量;所述ECF特征向量包括:卡方、Renyi交叉熵、单比特频数、块内频数、游程、最大游程、傅里叶变换、非重叠匹配、序列化和累加和;Step S102: Calculate and obtain the randomness feature ECF (Encrypted and Compresseddata Feature) feature vector of the traffic data; the ECF feature vector includes: chi-square, Renyi cross entropy, single-bit frequency, intra-block frequency, run length, maximum run length, Fu Leaf transform, non-overlapping matching, serialization and cumulative sum;
步骤S103:以ECF特征向量为输入,通过机器学习模型进行识别,识别结果包括加密流量和压缩流量。Step S103: Take the ECF feature vector as input and perform identification through the machine learning model. The identification results include encrypted traffic and compressed traffic.
进一步地,所述步骤S101包括:Further, the step S101 includes:
步骤S101.1:从外部网络中获取数据包,并保存为pcap文件;Step S101.1: Obtain data packets from the external network and save them as pcap files;
步骤S101.2:按照五元组(源IP地址、目的IP地址、源端口、目的端口、协议号)将获取的数据包划分成网络流,并保存为flow文件;Step S101.2: Divide the obtained data packets into network flows according to the five-tuple (source IP address, destination IP address, source port, destination port, protocol number) and save them as flow files;
步骤S101.3:对每个flow文件,按照TCP/IP协议格式进行解析,获取数据载荷部分,并按照数据包获取的先后顺序拼接为一个不定长的流量数据。Step S101.3: Parse each flow file according to the TCP/IP protocol format, obtain the data payload part, and splice it into a flow data of variable length according to the order in which the data packets are obtained.
进一步地,所述步骤S102包括:Further, the step S102 includes:
步骤S102.1:以字节为统计单位,获取流量数据的长度Len;Step S102.1: Using bytes as the statistical unit, obtain the length Len of the traffic data;
步骤S102.2:计算获取流量数据的ECF特征向量;所述ECF特征向量包括:卡方、Renyi交叉熵、单比特频数、块内频数、游程、最大游程、傅里叶变换、非重叠匹配、序列化和累加和。Step S102.2: Calculate and obtain the ECF feature vector of the traffic data; the ECF feature vector includes: chi-square, Renyi cross entropy, single-bit frequency, intra-block frequency, run length, maximum run length, Fourier transform, non-overlapping matching, Serialization and cumulative sum.
具体地,以字节为数据统计元素,则元素集合为{0x00,0x01……,0xFF},设输入的数据为data,data长度为Len,每个字节元素在data中出现的次数为Fi,i=0……255。在数据统计中,卡方能够很好的反应统计样本的实际观测值与理论推断值之间的偏离程度,Renyi交叉熵(α-Renyi熵)作为Shannon熵、Hartley熵和最小熵的推广,能够很好的量化统计数据的不确定性和随机性。因此本发明选取上述两个统计值,对压缩和加密数据的随机性进行分析,其中卡方以均匀分布作为理论推断值,Renyi交叉熵的随机参数α取0.5。公式(1)和公式(2)分别给出了本实施例中卡方及Renyi交叉熵计算公式。Specifically, using bytes as data statistical elements, the element set is {0x00,0x01...,0xFF}. Assume the input data is data, the data length is Len, and the number of times each byte element appears in data is F. i , i=0...255. In data statistics, chi-square can well reflect the degree of deviation between the actual observed value of a statistical sample and the theoretically inferred value. Renyi cross entropy (α-Renyi entropy), as a generalization of Shannon entropy, Hartley entropy and minimum entropy, can Great for quantifying uncertainty and randomness in statistics. Therefore, the present invention selects the above two statistical values to analyze the randomness of compressed and encrypted data. Chi-square uses uniform distribution as the theoretical inference value, and the random parameter α of Renyi cross entropy is 0.5. Formula (1) and formula (2) respectively provide the calculation formulas of chi-square and Renyi cross-entropy in this embodiment.
其余特征为将data作为二进制数据串,输入NIST SP 800-22随机数测试集中相应的检测项,得到的检测值p_value。The remaining features use data as a binary data string and input the corresponding detection items in the NIST SP 800-22 random number test set to obtain the detection value p_value.
ECF特征向量如表1所示,其向量元素为单精度浮点数,维度为12:The ECF feature vector is shown in Table 1. Its vector elements are single-precision floating point numbers with a dimension of 12:
表1 ECF特征向量元素Table 1 ECF feature vector elements
进一步地,在所述步骤S103之前还包括:Further, before step S103, it also includes:
基于ECF特征向量构建机器学习模型;所述机器学习模型所包含的机器学习算法包括随机森林、Xgboost和MLP。Build a machine learning model based on the ECF feature vector; the machine learning algorithms included in the machine learning model include random forest, Xgboost and MLP.
具体地,基于ECF特征向量构建机器学习模型包括:Specifically, building a machine learning model based on ECF feature vectors includes:
为尽可能模拟真实的网络流量检测环境,作为一种可实施方式,从网络公开渠道选取了6类文件,构建了基础数据集,并采用当前互联网通信中使用最为广泛的加密和压缩算法,设计了研究数据集生成算法。按照数据长度进行归类,构造了7个固定大小的研究数据集,其数据长度涵盖1KB到64KB,在后续部分中以D_1KB、D_2KB、D_4KB、D_8KB、D_16KB、D_32KB和D_64KB代表构造的7个研究数据集。In order to simulate the real network traffic detection environment as much as possible, as an implementable method, 6 types of files were selected from network public channels, a basic data set was constructed, and the most widely used encryption and compression algorithms in current Internet communications were used to design To study the data set generation algorithm. Classified according to data length, 7 fixed-size research data sets were constructed, with data lengths ranging from 1KB to 64KB. In the subsequent sections, D_1KB, D_2KB, D_4KB, D_8KB, D_16KB, D_32KB and D_64KB represent the 7 constructed studies. data set.
(1)基础数据集选取(1) Basic data set selection
表2给出了基础数据集的来源,需要说明的是:文本、图片、视频原始数据集较大,本实施例只选取了部分公开数据集的数据加入到基础数据集中,从公开网络下载的音频文件以中、英文歌曲为主,混合文档也是以中、英文文档为主,并含有少量日语文档。Table 2 shows the source of the basic data set. It should be noted that the original data sets of text, pictures, and videos are relatively large. This embodiment only selects part of the data from the public data set and adds it to the basic data set. Downloaded from the public network The audio files are mainly Chinese and English songs, and the mixed documents are also mainly Chinese and English documents, with a small amount of Japanese documents.
表2基础数据集情况Table 2 Basic data set situation
(2)加密和压缩算法选取:(2) Encryption and compression algorithm selection:
表3给出了选取的加密和压缩算法,其中加密算法的加密模式为CBC模式,WinRAR版本为5.71(64位),zip和Gzip算法来源于开源代码,3个加密算法均采用cryptoPP-820代码库实现。Table 3 shows the selected encryption and compression algorithms. The encryption mode of the encryption algorithm is CBC mode, the WinRAR version is 5.71 (64-bit), the zip and Gzip algorithms are derived from open source code, and the three encryption algorithms all use cryptoPP-820 code. library implementation.
表3选取的加密和压缩算法Table 3 selected encryption and compression algorithms
为实现加密的随机性,方便验证加解密的正确性,主要对加密密钥、密钥长度和初始IV(基础密钥)进行了随机化处理,加密算法的参数初始化算法以文件名为种子,对加密算法的初始参数进行了伪随机化,算法1给出了具体操作流程:In order to realize the randomness of encryption and facilitate the verification of the correctness of encryption and decryption, the encryption key, key length and initial IV (basic key) are mainly randomized. The parameter initialization algorithm of the encryption algorithm is seeded with the file name. The initial parameters of the encryption algorithm are pseudo-randomized. Algorithm 1 gives the specific operation process:
(3)目标研究数据集生成算法:(3) Target research data set generation algorithm:
算法2对研究数据集生成流程进行了基本描述,其中使用的所有加密和压缩算法由表2给出,每个基础数据集的文件都对应6个压缩或加密后的文件,然后再对每个加密或压缩后的文件进行不同长度的切割,形成7个不同长度的研究数据集。Algorithm 2 provides a basic description of the research data set generation process. All encryption and compression algorithms used are given in Table 2. Each file of the basic data set corresponds to 6 compressed or encrypted files, and then each The encrypted or compressed files are cut into different lengths to form 7 research data sets of different lengths.
由于加密和压缩算法的不同特性,同一文件加密后文件大小不变,压缩后一般小于原文件。为保持样本分布的均衡性,使得构造数据集的加密数据和压缩数据、原始数据来源的总体相等,为后续实验提供更加可信的研究数据集,以全体数据的最小切割数量为基准,等量随机选取了所有处理后的数据组成了研究数据集(D_1KB,D_2KB,…D_64KB)。例如D_1KB的构造过程如下:表4显示了基础数据集经过压缩和加密后的大小,以及按照1KB长度切割后对应的数量,可以看出其最小切割数量为227169(文本经过RAR压缩后的切割数量),以227169为基准,从所有生成的36个压缩或加密文件中,各随机选取227169条切割数据(长度1KB),生成D_1KB数据集,那么D_1KB数据集的数据总量就为8178084条。通过上述步骤的选取,D_1KB数据集里面加密或压缩的数据量(4089042条)、每类原始文件对应的加密和压缩的数据量(681507条)、每个加密或压缩算法对应的数据量(1363014条)都是相等的。Due to the different characteristics of encryption and compression algorithms, the file size of the same file remains unchanged after encryption, but is generally smaller than the original file after compression. In order to maintain the balance of the sample distribution, make the encrypted data, compressed data, and original data sources of the constructed data set equal in total, and provide a more credible research data set for subsequent experiments, based on the minimum number of cuts of the entire data, equal amounts All processed data were randomly selected to form the research data set (D_1KB, D_2KB,...D_64KB). For example, the construction process of D_1KB is as follows: Table 4 shows the size of the basic data set after compression and encryption, and the corresponding number after cutting according to the length of 1KB. It can be seen that the minimum number of cuts is 227169 (the number of cuts after the text is compressed by RAR ), based on 227169, randomly select 227169 pieces of cut data (length 1KB) from all 36 generated compressed or encrypted files to generate a D_1KB data set, then the total amount of data in the D_1KB data set is 8178084 pieces. Through the selection of the above steps, the amount of encrypted or compressed data in the D_1KB data set (4089042 items), the amount of encrypted and compressed data corresponding to each type of original file (681507 items), and the amount of data corresponding to each encryption or compression algorithm (1363014 ) are all equal.
表4基本数据集压缩和加密后基本情况Table 4 Basic situation after compression and encryption of basic data set
(4)特征向量提取(4) Feature vector extraction
对各研究数据集进行ECF特征向量提取。ECF feature vector extraction was performed on each research data set.
(5)选择合适的分类算法,以ECF特征向量作为输入,输出为加密或压缩流量(对应数据标签0、1),完成机器学习模型的训练。作为一种可实施方式,本实施例选取了机器学习中随机森林和Xgboost两个集成学习算法和MLP深度学习算法作为机器学习模型的分类算法。(5) Select an appropriate classification algorithm, take the ECF feature vector as input, and output encrypted or compressed traffic (corresponding to data labels 0 and 1) to complete the training of the machine learning model. As an implementable manner, this embodiment selects two integrated learning algorithms, random forest and Xgboost, and the MLP deep learning algorithm in machine learning as the classification algorithm of the machine learning model.
进一步地,所述步骤S103包括:Further, the step S103 includes:
步骤S103.1:根据流量数据的长度Len和期望的测试精度,选择已经训练好的的机器学习模型进行识别;Step S103.1: Based on the length Len of the traffic data and the expected test accuracy, select the trained machine learning model for identification;
步骤S103.2:将流量数据的ECF特征向量,输入机器学习模型得到识别结果,识别结果包括加密流量和压缩流量。Step S103.2: Input the ECF feature vector of the traffic data into the machine learning model to obtain the identification results. The identification results include encrypted traffic and compressed traffic.
为验证本发明的有效性,进行如下实验:In order to verify the effectiveness of the present invention, the following experiments were carried out:
在构建的7个研究数据集和文献1(Hahn D,Apthorpe N,Feamster N.DetectingCompressed Cleartext Traffic from Consumer Internet of Things Devices[J].arXiv preprint arXiv:1805.02722,2018.)给出的公开数据集(detect)上,分别对本发明的分类效果、泛化性能和计算复杂度进行了测试。其中ECF特征向量提取在VS2017平台下,使用C++实现。机器学习分类算法在PyCharm平台下,使用python 2.7实现。实验用机器操作系统为win10(x64),处理器为CoreTM i7-7700 CPU@3.60HZ,内存大小8G。In the 7 research data sets constructed and the public data set given in literature 1 (Hahn D, Apthorpe N, Feamster N. DetectingCompressed Cleartext Traffic from Consumer Internet of Things Devices[J]. arXiv preprint arXiv:1805.02722, 2018.) ( detect), the classification effect, generalization performance and computational complexity of the present invention were tested respectively. The ECF feature vector extraction is implemented under the VS2017 platform using C++. The machine learning classification algorithm is implemented under the PyCharm platform using python 2.7. The operating system of the experimental machine is win10 (x64), and the processor is Core TM i7-7700 CPU@3.60HZ, memory size 8G.
(a)分类效果(a) Classification effect
按照本发明方法,首先对7个研究数据集的所有数据进行了ECF特征提取。经过特征提取,将所有数据与一个12维的ECF特征向量对应起来,然后随机选取50%数据作为训练集,剩余数据作为测试集,对检测模型选取的三个分类算法分别进行了20次训练和测试。同时为验证检测模型的有效性,使用文献1所提供的公开数据集(detect),对本发明的机器学习模型进行了训练和测试,由于detect数据集数据规模较小(16796条),采用3折交叉验证法,对机器学习模型选取的三个分类算法分别进行了30次训练和测试。According to the method of the present invention, ECF feature extraction is first performed on all data of the seven research data sets. After feature extraction, all data were matched with a 12-dimensional ECF feature vector, and then 50% of the data was randomly selected as the training set, and the remaining data was used as the test set. The three classification algorithms selected for the detection model were trained and trained 20 times respectively. test. At the same time, in order to verify the effectiveness of the detection model, the public data set (detect) provided in Document 1 was used to train and test the machine learning model of the present invention. Since the detect data set has a small data size (16,796 items), a 3-fold Using the cross-validation method, the three classification algorithms selected by the machine learning model were trained and tested 30 times respectively.
本发明以测试精度作为模型分类效果的评价指标,测试精度取多次测试总体精度的平均值。每次测试的总体精度=(TP+FN)/(TP+TN+FP+FN),其中TP表示分类为加密数据的集合中加密数据数量,FP表示分类为加密数据的集合中压缩数据的数量,TN表示分类为压缩数据的集合中加密数据的数量,FN表示分类为压缩数据的集合中压缩数据的数量。表5给出了检测模型的主要参数和测试精度。可以看出,三个分类算法,除了随机森林和Xgboost的n_estimators参数,在面对长度小于4KB的数据集时略有不同,剩余其他参数在面对不同数据集时,选取基本相同,这表明3个分类算法在面对不同长度数据时,具有一定的稳定性,同时也说明,使用ECF特征向量,可以对不同长度的加密和压缩数据进行刻画。在测试精度方面,随着数据长度的增加,测试精度也随之提高,这与直观期望是相符的,特别在数据达到8KB时,分类精度就达到92%以上,证明在本发明提出的机器学习模型下,只需8KB数据,ECF特征向量就能很充分的表现出加密与压缩数据的差异性。The present invention uses test accuracy as an evaluation index for model classification effect, and the test accuracy takes the average of the overall accuracy of multiple tests. The overall accuracy of each test = (TP+FN)/(TP+TN+FP+FN), where TP represents the number of encrypted data in the set classified as encrypted data, and FP represents the number of compressed data in the set classified as encrypted data. , TN represents the number of encrypted data in the set classified as compressed data, and FN represents the number of compressed data in the set classified as compressed data. Table 5 gives the main parameters and test accuracy of the detection model. It can be seen that the three classification algorithms, except for the n_estimators parameters of random forest and Xgboost, are slightly different when facing data sets less than 4KB in length. The remaining other parameters are basically the same when facing different data sets, which shows that 3 This classification algorithm has a certain stability when facing data of different lengths. It also shows that using ECF feature vectors can characterize encrypted and compressed data of different lengths. In terms of test accuracy, as the data length increases, the test accuracy also increases, which is consistent with intuitive expectations. Especially when the data reaches 8KB, the classification accuracy reaches more than 92%, which proves that the machine learning proposed by the present invention is Under the model, with only 8KB of data, the ECF feature vector can fully express the difference between encrypted and compressed data.
表5机器学习模型主要参数及性能Table 5 Main parameters and performance of machine learning model
图2列出了本发明方法和文献1、文献2(Casino F,Choo K K R,PatsakisC.HEDGE:Efficient Traffic Classification of Encrypted and Compressed Packets[J].IEEE Transactions on Information Forensics and Security,2019:2916-2926.)提出的检测方法测试精度的对比情况。在不同数据集的测试中本发明方法都取得了更好的检测效果。本发明方法对长度为64KB的检测数据,实现了99.51%的分类精度;数据达到8KB以上时,实现了92%以上的分类精度;数据为1KB时,实现了71.42%的分类精度,对比文献2提出的HEDGE方法,在数据长度从1KB到64KB的7个数据集上,测试精度有2.74%-7.22%不同程度的提高。特别在公开数据集detect(数据长度均为1KB)上,本发明的检测方法同样表现稳定,相比文献1提出的方法(Daniel Hahn),提高了7%,相比对比文件2提出的HEDGE方法,也有3.3%的提高。Figure 2 lists the method of the present invention and Document 1 and Document 2 (Casino F, Choo K K R, Patsakis C. HEDGE: Efficient Traffic Classification of Encrypted and Compressed Packets [J]. IEEE Transactions on Information Forensics and Security, 2019: 2916-2926 .) Comparison of the test accuracy of the proposed detection methods. In tests on different data sets, the method of the present invention has achieved better detection results. The method of the present invention achieves a classification accuracy of 99.51% for detection data with a length of 64KB; when the data reaches more than 8KB, a classification accuracy of more than 92% is achieved; when the data is 1KB, a classification accuracy of 71.42% is achieved, compare to literature 2 The proposed HEDGE method has improved test accuracy by 2.74%-7.22% to varying degrees on 7 data sets with data lengths from 1KB to 64KB. Especially on the public data set detect (data length is 1KB), the detection method of the present invention also performs stably. Compared with the method proposed in Document 1 (Daniel Hahn), it has improved by 7%. Compared with the HEDGE method proposed in Reference Document 2 , there was also an increase of 3.3%.
从实验可以看出,当数据长度小于2KB时,分类精度的提高依然是这项工作的难点。例如,以字节数据集{0x00,0x01……,0xFF}为统计元素,每个字节元素在1KB数据中均匀分布的频率是1/256,出现的频次为4次;在64KB数据中均匀分布的频率也是1/256,出现的频次为256次。如果在实际数据中,某个字节出现的频次较均匀分布增加了1次,那么在1KB的统计数据上,其频次变为5次,频率变为5/1024;在64KB的统计数据上,其频次变为257次,频率变为257/(64*1024)。可以看出:面对频次增加1次的相同波动,频率在长度为1KB和64KB数据上的变化率分别为25%和0.39%,这就表明,从统计角度看,数据量越小,字节统计频率对频次的波动越明显,进而导致加密和压缩算法的不同特性表现的不够充分。这在文献2的工作中也有体现,所以在1KB的数据上,精度能够提高3.3%相较于在64K的数据上4.8%的提高更有难度。在现实检测中,遇到的小数据往往更多,从而在小数据集上精度的提高,对实际检测工作的帮助也更大。It can be seen from the experiment that when the data length is less than 2KB, improving the classification accuracy is still a difficulty in this work. For example, taking the byte data set {0x00,0x01...,0xFF} as the statistical element, the frequency of each byte element uniformly distributed in 1KB data is 1/256, and the frequency of occurrence is 4 times; evenly distributed in 64KB data The frequency of distribution is also 1/256, and the frequency of occurrence is 256 times. If in the actual data, the frequency of a certain byte increases by 1 time compared with the uniform distribution, then in the 1KB statistical data, its frequency becomes 5 times, and the frequency becomes 5/1024; in the 64KB statistical data, Its frequency becomes 257 times, and its frequency becomes 257/(64*1024). It can be seen that in the face of the same fluctuation with the frequency increased by 1 time, the change rates of frequency on data with lengths of 1KB and 64KB are 25% and 0.39% respectively. This shows that from a statistical point of view, the smaller the amount of data, the smaller the number of bytes. The statistical frequency fluctuates more and more obviously, which leads to the insufficient performance of different characteristics of encryption and compression algorithms. This is also reflected in the work of Reference 2, so it is more difficult to improve the accuracy by 3.3% on 1KB data than on 64K data by 4.8%. In real-life detection, more small data are often encountered, so the improvement of accuracy on small data sets will be more helpful to actual detection work.
另外由于本发明选取的随机性特征ECF特征向量,主要在统计学的角度刻画了数据的随机性,相比较而言,加密数据的随机性要优于压缩数据,这就导致了数据长度较短的情况下,加密数据召回率较高、压缩数据召回率较低,进而导致分类结果中加密数据的精度低,压缩数据精度高。表6给出了Xgboost在D_1KB数据集上单次测试的详细报告,从中可以明显看到上述情况。因此,寻找更加完善的特征集,以提高压缩数据的召回率,进而提高加密数据和总体的分类精度,仍是今后这类工作的研究方向和挑战之一。In addition, the randomness feature ECF feature vector selected by this invention mainly depicts the randomness of data from a statistical perspective. In comparison, the randomness of encrypted data is better than that of compressed data, which results in a shorter data length. In this case, the recall rate of encrypted data is high and the recall rate of compressed data is low, which leads to low accuracy of encrypted data and high accuracy of compressed data in the classification results. Table 6 gives the detailed report of a single test of Xgboost on the D_1KB data set, from which the above situation can be clearly seen. Therefore, finding a more complete feature set to improve the recall rate of compressed data, thereby improving the encrypted data and overall classification accuracy, is still one of the research directions and challenges for this type of work in the future.
表6 Xgboost算法在1KB数据集上的测试精度报告Table 6 Test accuracy report of Xgboost algorithm on 1KB data set
(b)泛化性能测试(b) Generalization performance test
(b.1)长度相同数据集之间的泛化性能测试(b.1) Generalization performance test between data sets of the same length
这项测试以D_1KB和detect两个数据长度均为1KB的数据集分别作为训练集,单独训练出两个机器学习模型,并以这两个数据集为测试集,对训练得到的两个模型进行泛化测试。图3测试结果显示,使用detect数据集训练的分类模型,对D_1KB数据集进行测试,最高取得了70.92%(MLP算法)的测试精度,使用D_1KB数据集训练的分类模型,对detect数据集进行测试,最高取得了72.98%(随机森林算法)的测试精度。可以看出本发明的机器学习模型选取的3个分类算法,在训练集和测试集数据长度相同,数据来源不同的情况下,依然取得了较好的测试精度。This test uses two data sets, D_1KB and detect, both with a data length of 1KB, as training sets to train two machine learning models separately, and uses these two data sets as test sets to test the two trained models. Generalization testing. The test results in Figure 3 show that using the classification model trained on the detect data set, the D_1KB data set was tested, and the highest test accuracy of 70.92% (MLP algorithm) was achieved. The classification model trained on the D_1KB data set was used to test the detect data set. , achieving the highest test accuracy of 72.98% (random forest algorithm). It can be seen that the three classification algorithms selected by the machine learning model of the present invention still achieve good test accuracy when the data lengths of the training set and the test set are the same and the data sources are different.
(b.2)长度不同数据集之间的泛化性能测试(b.2) Generalization performance test between data sets of different lengths
本项试验在本实施例构造的7个研究数据集上完成,主要测试了两类泛化情况:第一类情况取D_1KB为训练数据集,测试数据集为所有研究数据集。第二类情况取D_64KB为训练数据集,测试数据集为所有研究数据集。图4和图5分别给出了测试结果,可以看到,当训练数据集为D_1KB时,训练得到的模型在其他6个研究数据集上也能取得较好的分类精度;特别是数据长度达到8KB以上时,测试精度也能达到90%以上。当训练数据集为D_64KB时,训练得到的模型尽管在D_32KB数据集上表现较好,但在长度小于4KB的数据集上的测试精度较低;特别在D_1KB数据集上,相较训练数据集和测试数据集都为D_1KB的情况,模型算法为Xgboost的测试精度有10%的降低,这对长度为1KB的数据分类来说,影响非常大。上述情况表明,在数据长度较小的数据集上训练得到的模型,可以用来检测长度较大的数据,反之则不可行。This experiment was completed on the 7 research data sets constructed in this embodiment, and mainly tested two types of generalization situations: in the first type of situation, D_1KB was used as the training data set, and the test data set was all research data sets. In the second type of case, D_64KB is taken as the training data set, and the test data set is all research data sets. Figure 4 and Figure 5 show the test results respectively. It can be seen that when the training data set is D_1KB, the trained model can also achieve better classification accuracy on the other 6 research data sets; especially when the data length reaches When the size is more than 8KB, the test accuracy can reach more than 90%. When the training data set is D_64KB, although the trained model performs better on the D_32KB data set, the test accuracy on data sets less than 4KB is lower; especially on the D_1KB data set, compared with the training data set and When the test data sets are all D_1KB, the test accuracy of the model algorithm using Xgboost is reduced by 10%, which has a great impact on data classification with a length of 1KB. The above situation shows that a model trained on a data set with a small data length can be used to detect data with a large length, but the reverse is not feasible.
(c)时间复杂度测试(c) Time complexity test
为测试模型中三个分类算法的训练耗时情况,取各自算法30次训练的平均耗时作为其时间复杂度的评价指标,对3个算法的时间复杂度进行了简单评估。表7给出了在D_1KB数据集上,选取训练数据90万条时,训练一次所需的时间。可以看出在训练数据、硬件配置相同的条件下,MLP所需训练时间最短,而Xgboost所需训练时间最长(几乎是MLP的20倍)。In order to test the training time consumption of the three classification algorithms in the model, the average time consumption of 30 training times of each algorithm was taken as the evaluation index of its time complexity, and the time complexity of the three algorithms was briefly evaluated. Table 7 shows the time required for training once on the D_1KB data set when 900,000 training data are selected. It can be seen that under the same conditions of training data and hardware configuration, MLP requires the shortest training time, while Xgboost requires the longest training time (almost 20 times that of MLP).
表7不同学习模型的时间对比Table 7 Time comparison of different learning models
同时在实验中发现,当训练数据集数据量较小时,分类算法选择和算法参数调节对测试精度影响较大。如在detect数据集(16796条)上,随机森林的测试精度要高于MLP(测试精度见图3);当训练数据集数据量足够大时,三个分类算法测试精度几乎相等,测试精度对模型参数的调节也变得不敏感。图6给出了不同分类算法,在D_1KB数据集上,模型参数(表7)不变的情况下,训练数据量对测试精度(取15次训练总体精度的平均值)的影响。当数据量低于2.5万条时,不同分类算法的测试精度差异较大,此时可以通过调节模型参数,使不同模型的精度差异变小;当数据量达到10万条时,三个分类算法的测试精度趋于相等且平稳;在30万条以上时,不同算法几乎没有差异。所以在模型构建时,要根据训练数据集数据量的多少,进行灵活选择,当数据量较大时,可以选择训练时间较少的模型(MLP),模型参数进行粗粒度的调节就可以达到较好的分类效果;当数据量较小时,要选择精度较高的模型(随机森林),并对其参数进行细致优化,以达到较好的分类效果。At the same time, it was found in the experiment that when the amount of training data set is small, the selection of classification algorithm and the adjustment of algorithm parameters have a greater impact on the test accuracy. For example, on the detect data set (16796 items), the test accuracy of random forest is higher than that of MLP (see Figure 3 for test accuracy); when the training data set is large enough, the test accuracy of the three classification algorithms is almost equal, and the test accuracy is The adjustment of model parameters also becomes insensitive. Figure 6 shows the impact of different classification algorithms on the D_1KB data set and the model parameters (Table 7) unchanged, the amount of training data on the test accuracy (the average of the overall accuracy of 15 training times). When the amount of data is less than 25,000 items, the test accuracy of different classification algorithms differs greatly. At this time, the model parameters can be adjusted to make the accuracy difference of different models smaller; when the amount of data reaches 100,000 items, the three classification algorithms The test accuracy tends to be equal and stable; when there are more than 300,000 items, there is almost no difference between different algorithms. Therefore, when building a model, you need to make flexible choices based on the amount of data in the training data set. When the amount of data is large, you can choose a model (MLP) with less training time. Coarse-grained adjustment of the model parameters can achieve a better result. Good classification effect; when the amount of data is small, a model (random forest) with higher accuracy should be selected, and its parameters should be carefully optimized to achieve better classification effect.
值得说明的是,为了快速验证本发明效果,本发明构建了7个研究数据集(由加密流量和压缩流量组成),上述实验均基于本实施例的机器学习模型为二分类模型进行。本领域技术人员可以知道,在上述实施例的基础上,本发明同样可以构建包括加密流量、压缩流量和其他流量(非加密流量、非压缩流量)的研究数据集,从而构建三分类机器学习模型,如图7所示,以进行相关实验,并得出和基于二分类机器学习模型相似的实验结果。It is worth noting that in order to quickly verify the effect of the present invention, the present invention constructed 7 research data sets (composed of encrypted traffic and compressed traffic). The above experiments were all conducted based on the machine learning model of this embodiment as a two-class model. Those skilled in the art will know that based on the above embodiments, the present invention can also construct a research data set including encrypted traffic, compressed traffic and other traffic (non-encrypted traffic, non-compressed traffic), thereby building a three-class machine learning model , as shown in Figure 7, to conduct related experiments and obtain experimental results similar to those based on the two-classification machine learning model.
根据加密数据和压缩数据随机性特征的差异性,本文构造了一个有效的随机性特征ECF并设计了合理的机器学习模型用于流量识别,基于主流机器学习算法实现了对加密和压缩流量的分类。另外,本发明构造了分布均衡的研究数据集,对特征集和识别模型进行了实验验证,结果表明本发明提出的方法可以实现对加密和压缩数据的有效区分。从实验的检测效果和泛化效果可以看出,本发明提出的机器学习模型,在实际的网络流量测试中只依赖数据包的载荷数据长度,与数据包顺序、网络协议等其他因素无关,可以较容易实现跨平台迁移。由于所有公开加密算法的随机特性表现基本相似,因此该方法具有较强的泛化性能。在其他数据集上,可以直接使用当前数据集训练的模型,即使不再重新训练,也能取得较好的分类效果。当然,如果基于新的数据集做进一步训练,精度还可以有一定的提高。Based on the difference in randomness characteristics of encrypted data and compressed data, this paper constructs an effective randomness feature ECF and designs a reasonable machine learning model for traffic identification. It implements the classification of encrypted and compressed traffic based on mainstream machine learning algorithms. . In addition, the present invention constructs a research data set with balanced distribution, and conducts experimental verification on the feature set and recognition model. The results show that the method proposed by the present invention can achieve effective distinction between encrypted and compressed data. It can be seen from the experimental detection effect and generalization effect that the machine learning model proposed by the present invention only relies on the payload data length of the data packet in the actual network traffic test, and has nothing to do with other factors such as data packet sequence and network protocol. Easier to implement cross-platform migration. Since the random characteristics of all public encryption algorithms behave similarly, this method has strong generalization performance. On other data sets, you can directly use the model trained on the current data set to achieve better classification results even without retraining. Of course, if further training is done based on new data sets, the accuracy can be improved to a certain extent.
依照本发明提出的随机性特征ECF,对流量数据进行ECF特征向量提取,将ECF特征向量输入预先训练好的机器学习模型,识别出流量的类型,包括加密流量和压缩流量。传统的基于统计特征的流量识别方法,统计特征选取较少,判别方法单一,对加密和压缩流量的识别精度较低。本发明提供的方法,利用机器学习算法对大数据的学习优势,设计了有效的机器学习识别模型,克服了上述缺点,提高了加密和压缩流量的识别精度,为网络管理工作中网络流量的精细化识别提供了技术支撑,特别是在获取部分数据或者数据量较少的情况下,依旧能够对加密和压缩流量进行较高精度的识别。According to the random feature ECF proposed by the present invention, the ECF feature vector is extracted from the traffic data, and the ECF feature vector is input into the pre-trained machine learning model to identify the type of traffic, including encrypted traffic and compressed traffic. The traditional traffic identification method based on statistical features selects fewer statistical features, has a single discrimination method, and has low identification accuracy for encrypted and compressed traffic. The method provided by the present invention utilizes the learning advantages of machine learning algorithms on big data to design an effective machine learning identification model, overcomes the above shortcomings, improves the identification accuracy of encrypted and compressed traffic, and provides a basis for the refinement of network traffic in network management work. It provides technical support for chemical identification. Especially when partial data is obtained or the amount of data is small, it can still identify encrypted and compressed traffic with high accuracy.
在上述实施例的基础上,如图8所示,本发明还公开一种基于随机性特征的加密和压缩流量识别装置,包括:Based on the above embodiments, as shown in Figure 8, the present invention also discloses an encryption and compression traffic identification device based on random characteristics, including:
采集解析模块,用于对网络数据进行采集,并解析得到流量数据;The collection and analysis module is used to collect network data and analyze the traffic data;
特征提取模块,用于计算并得到流量数据的随机性特征ECF特征向量;所述ECF特征向量包括:卡方、Renyi交叉熵、单比特频数、块内频数、游程、最大游程、傅里叶变换、非重叠匹配、序列化和累加和;Feature extraction module, used to calculate and obtain the random characteristic ECF feature vector of traffic data; the ECF feature vector includes: chi-square, Renyi cross entropy, single-bit frequency, intra-block frequency, run length, maximum run length, Fourier transform , non-overlapping matching, serialization and cumulative sum;
流量识别模块,用于以ECF特征向量为输入,通过机器学习模型进行识别,识别结果包括加密流量和压缩流量。The traffic identification module is used to use the ECF feature vector as input to identify through the machine learning model. The identification results include encrypted traffic and compressed traffic.
进一步地,所述采集解析模块包括:Further, the collection and analysis module includes:
数据包获取子模块,用于从外部网络中获取数据包,并保存为pcap文件;The data packet acquisition submodule is used to obtain data packets from the external network and save them as pcap files;
网络流划分子模块,用于按照五元组将获取的数据包划分成网络流,并保存为flow文件;The network flow division sub-module is used to divide the acquired data packets into network flows according to five-tuple groups and save them as flow files;
解析子模块,用于对每个flow文件,按照TCP/IP协议格式进行解析,获取数据载荷部分,并按照数据包获取的先后顺序拼接为一个不定长的流量数据。The parsing sub-module is used to parse each flow file according to the TCP/IP protocol format, obtain the data payload part, and splice it into a flow data of variable length according to the order in which the data packets are obtained.
进一步地,所述特征提取模块包括:Further, the feature extraction module includes:
流量数据长度获取子模块,用于以字节为统计单位,获取流量数据的长度Len;The traffic data length acquisition submodule is used to obtain the length Len of the traffic data in bytes as the statistical unit;
特征提取子模块,用于计算获取流量数据的ECF特征向量。The feature extraction submodule is used to calculate the ECF feature vector of traffic data.
进一步地,还包括:Furthermore, it also includes:
识别模型构建模块,用于基于ECF特征向量构建机器学习模型;所述机器学习模型所包含的机器学习算法包括随机森林、Xgboost和MLP。An identification model building module is used to build a machine learning model based on the ECF feature vector; the machine learning algorithms included in the machine learning model include random forest, Xgboost and MLP.
进一步地,所述流量识别模块包括:Further, the traffic identification module includes:
模型选择子模块,用于根据流量数据的长度Len和期望的测试精度,选择已经训练好的的机器学习模型进行识别;The model selection submodule is used to select a trained machine learning model for identification based on the length Len of the traffic data and the expected test accuracy;
流量识别子模块,用于将流量数据的ECF特征向量,输入机器学习模型得到识别结果,识别结果包括加密流量和压缩流量。The traffic identification submodule is used to input the ECF feature vector of traffic data into the machine learning model to obtain identification results. The identification results include encrypted traffic and compressed traffic.
以上所示仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。What is shown above is only the preferred embodiment of the present invention. It should be pointed out that for those of ordinary skill in the art, several improvements and modifications can be made without departing from the principles of the present invention. These improvements and modifications can also be made. should be regarded as the protection scope of the present invention.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010432177.4A CN111723846B (en) | 2020-05-20 | 2020-05-20 | Encrypted and compressed traffic identification method and device based on random characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010432177.4A CN111723846B (en) | 2020-05-20 | 2020-05-20 | Encrypted and compressed traffic identification method and device based on random characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111723846A CN111723846A (en) | 2020-09-29 |
CN111723846B true CN111723846B (en) | 2024-01-26 |
Family
ID=72564730
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010432177.4A Active CN111723846B (en) | 2020-05-20 | 2020-05-20 | Encrypted and compressed traffic identification method and device based on random characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111723846B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113938437A (en) * | 2021-11-30 | 2022-01-14 | 湖北天融信网络安全技术有限公司 | A traffic identification method, device, electronic device and storage medium |
CN114301850B (en) * | 2021-12-03 | 2024-03-15 | 成都中科微信息技术研究院有限公司 | Military communication encryption flow identification method based on generation of countermeasure network and model compression |
CN114244779B (en) * | 2021-12-14 | 2024-08-13 | 湖北天融信网络安全技术有限公司 | Traffic identification method and device and storage medium |
CN114329119A (en) * | 2021-12-29 | 2022-04-12 | 厦门安胜网络科技有限公司 | Deep learning-based important information ordering method and device for traffic analysis and storage medium |
CN114866485B (en) * | 2022-03-11 | 2023-09-29 | 南京华飞数据技术有限公司 | Network traffic classification method and classification system based on aggregation entropy |
CN115174170B (en) * | 2022-06-23 | 2023-05-09 | 东北电力大学 | A VPN Encrypted Traffic Identification Method Based on Ensemble Learning |
CN117955744B (en) * | 2024-03-26 | 2024-06-07 | 江苏大道云隐科技有限公司 | Cross-platform information security transmission method and system |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6760845B1 (en) * | 2002-02-08 | 2004-07-06 | Networks Associates Technology, Inc. | Capture file format system and method for a network analyzer |
CN105871619A (en) * | 2016-04-18 | 2016-08-17 | 中国科学院信息工程研究所 | Method for n-gram-based multi-feature flow load type detection |
CN110012029A (en) * | 2019-04-22 | 2019-07-12 | 中国科学院声学研究所 | A method and system for distinguishing between encrypted and non-encrypted compressed traffic |
WO2019144521A1 (en) * | 2018-01-23 | 2019-08-01 | 杭州电子科技大学 | Deep learning-based malicious attack detection method in traffic cyber physical system |
-
2020
- 2020-05-20 CN CN202010432177.4A patent/CN111723846B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6760845B1 (en) * | 2002-02-08 | 2004-07-06 | Networks Associates Technology, Inc. | Capture file format system and method for a network analyzer |
CN105871619A (en) * | 2016-04-18 | 2016-08-17 | 中国科学院信息工程研究所 | Method for n-gram-based multi-feature flow load type detection |
WO2019144521A1 (en) * | 2018-01-23 | 2019-08-01 | 杭州电子科技大学 | Deep learning-based malicious attack detection method in traffic cyber physical system |
CN110012029A (en) * | 2019-04-22 | 2019-07-12 | 中国科学院声学研究所 | A method and system for distinguishing between encrypted and non-encrypted compressed traffic |
Non-Patent Citations (1)
Title |
---|
丁杰 ; 黄亮 ; 庹宇鹏 ; 桑亚飞 ; 张永铮 ; .基于n-gram多特征的流量载荷类型分类方法.计算机应用与软件.2017,(02),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN111723846A (en) | 2020-09-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111723846B (en) | Encrypted and compressed traffic identification method and device based on random characteristics | |
Hubballi et al. | $ Bitcoding $: Network traffic classification through encoded bit level signatures | |
Boukhtouta et al. | Network malware classification comparison using DPI and flow packet headers | |
Hu et al. | [Retracted] CLD‐Net: A Network Combining CNN and LSTM for Internet Encrypted Traffic Classification | |
Wang et al. | Netmamba: Efficient network traffic classification via pre-training unidirectional mamba | |
Zheng et al. | GCN‐ETA: High‐Efficiency Encrypted Malicious Traffic Detection | |
Breitinger et al. | FRASH: A framework to test algorithms of similarity hashing | |
Agrafiotis et al. | Image-based neural network models for malware traffic classification using pcap to picture conversion | |
Yang et al. | Malicious encryption traffic detection based on NLP | |
US12388866B2 (en) | Systems and methods for malicious URL pattern detection | |
CN109831422A (en) | A kind of encryption traffic classification method based on end-to-end sequence network | |
Yu et al. | An encrypted malicious traffic detection system based on neural network | |
Tang et al. | Entropy-based feature extraction algorithm for encrypted and non-encrypted compressed traffic classification | |
CN113923026A (en) | Encrypted malicious flow detection model based on TextCNN and construction method thereof | |
Tang et al. | Pluto: A robust LDOS attack defense system executing at line speed | |
Singh | Real Time Intrusion Detection In Edge Computing Using Machine Learning Techniques | |
Lee et al. | Malicious traffic compression and classification technique for secure Internet of Things | |
Babiker et al. | A hybrid feature-selection approach for finding the digital evidence of web application attacks | |
Saleh et al. | Combining raw data and engineered features for optimizing encrypted and compressed internet of things traffic classification | |
Fu et al. | Accurate compressed traffic detection via traffic analysis using Graph Convolutional Network based on graph structure feature | |
Pan et al. | FlowBERT: An encrypted traffic classification model based on transformers using flow sequence | |
Ma et al. | Bi-ETC: A bidirectional encrypted traffic classification model based on BERT and bilstm | |
Komisarek et al. | A novel, refined dataset for real-time Network Intrusion Detection | |
Aldwairi et al. | Characterizing realistic signature-based intrusion detection benchmarks | |
Cheng et al. | Automatic traffic signature extraction based on fixed bit offset algorithm for traffic classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP03 | Change of name, title or address | ||
CP03 | Change of name, title or address |
Address after: 450000 Science Avenue 62, Zhengzhou High-tech Zone, Henan Province Patentee after: Information Engineering University of the Chinese People's Liberation Army Cyberspace Force Country or region after: China Address before: No. 62 Science Avenue, High tech Zone, Zhengzhou City, Henan Province Patentee before: Information Engineering University of Strategic Support Force,PLA Country or region before: China |