CN112217822B

CN112217822B - Detection method for intrusion data

Info

Publication number: CN112217822B
Application number: CN202011088479.0A
Authority: CN
Inventors: 任午令; 张晓冰
Original assignee: Zhejiang Gongshang University
Current assignee: SuZi Information Technology (Hangzhou) Co.,Ltd.
Priority date: 2020-10-13
Filing date: 2020-10-13
Publication date: 2022-05-27
Anticipated expiration: 2040-10-13
Also published as: CN112217822A

Abstract

The invention discloses a detection method for intrusion data, which specifically comprises the following steps: 1) a balanced data set acquisition step, 2) a data classification step, and 3) a classifier evaluation step; the invention provides a detection method aiming at intrusion data, which improves data protection and detection generalization performance.

Description

A detection method for intrusion data

技术领域technical field

本发明涉及信息减少入侵检测技术领域，更具体的说，它涉及一种针对入侵数据的检测方法。The invention relates to the technical field of information reduction intrusion detection, and more particularly, to a detection method for intrusion data.

背景技术Background technique

随着互联行业发展，防止信息被入侵成为重要课题之一，其中数据的不均衡分布是不能有效提高信息防护的问题之一。数据的不均衡分布即在数据集中有一个或者几个类别的样本数量远超过其他类型的样本数。数据占比较小的类别称为少数类，占比较多的类别成为多数类。网络攻击类型繁杂，有些攻击类型很常见如DDOS、暴力破解、ARP欺骗等。而有些攻击类型出现比较少，如未获授权的本地超级用户特权访问、远程主机的未授权的访问等。因此这些类型的攻击样本数比常见类型的攻击数量少很多。DDOS攻击带来的后果可能对整个网络破坏，降低服务性能，阻止终端服务，而远程主机的未授权访问会导致主机被控制从而进行违法犯罪活动等。现有的分类方法对于多数类别的样本点识别率高，因此将少数类别误分类，从而引发很大的问题降低了对少数类别的信息防范。因此处理不平衡数据集，提高数据入侵检测模型的泛化性能是很重要的。With the development of the Internet industry, preventing information from being invaded has become one of the important topics, among which the uneven distribution of data is one of the problems that cannot effectively improve information protection. Unbalanced distribution of data means that the number of samples of one or several categories in the data set far exceeds the number of samples of other types. The category with a smaller proportion of data is called the minority category, and the category with a larger proportion of the data is called the majority category. There are various types of network attacks, some of which are very common, such as DDOS, brute force cracking, and ARP spoofing. Some attack types appear less frequently, such as unauthorized access to local superuser privileges, unauthorized access to remote hosts, and so on. Therefore, the number of samples of these types of attacks is much smaller than the number of common types of attacks. The consequences of a DDOS attack may damage the entire network, reduce service performance, block terminal services, and unauthorized access to remote hosts will lead to host control to conduct illegal and criminal activities. The existing classification methods have a high recognition rate for the sample points of the majority categories, so the minority categories are misclassified, which causes great problems and reduces the information prevention for the minority categories. Therefore, it is important to deal with imbalanced datasets and improve the generalization performance of data intrusion detection models.

发明内容SUMMARY OF THE INVENTION

本发明克服了现有技术的不足，提供了提高数据防护，提升检测泛化性能的一种针对入侵数据的检测方法。The invention overcomes the deficiencies of the prior art, and provides a detection method for intrusion data that improves data protection and improves detection generalization performance.

为了解决上述技术问题，本发明的技术方案如下：In order to solve the above-mentioned technical problems, the technical scheme of the present invention is as follows:

一种针对入侵数据的检测方法，具体包括如下步骤：A detection method for intrusion data, which specifically includes the following steps:

1)均衡数据集获取步骤：使用粗聚类方法将训练数据根据欧氏距离计算样本点到簇中心点的距离，并分为多个聚类子集，将包含样本点较少，距离较远的聚类子集视为噪声点，删除这些噪声数据；然后对入侵数据的不同类别进行随机采样并进行降维，降低入侵数据的不同类别模型的过拟合，并通过过采样方法对训练集进行类内均衡，增加部分类别的样本数量，得到均衡数据集；1) The steps of obtaining a balanced data set: Use the coarse clustering method to calculate the distance from the sample point to the cluster center point from the training data according to the Euclidean distance, and divide it into multiple cluster subsets, which will contain fewer sample points and farther distances. The clustered subsets of the intrusion data are regarded as noise points, and these noise data are deleted; then different categories of intrusion data are randomly sampled and dimensionality reduction is carried out to reduce the overfitting of different categories of intrusion data models, and the training set is adjusted by the oversampling method. Perform intra-class balance, increase the number of samples in some categories, and obtain a balanced data set;

2)数据分类步骤：由分类器对均衡数据集进行分类，分类器把错分类样本数量权值进行调整，提高分类模型的泛化性能；；其中，通过AdaBoost M1方法进行多次迭代训练分类器中的弱分类器，且每一次训练好的弱分类器将会参与下一次迭代训练；具体根据上一次迭代结果，提高误分类在多数类的样本点在训练集中所占权值，同时把正确分类样本点的权值减少，以进入下一次迭代，提高分类器的分类性能；下一次迭代产生的分类器更加关注上一次迭代分类器分类出的错误样本，以此增加样本分类的正确率，；最后根据每次迭代产生的分类器进行投票决定分类结果；2) Data classification step: the balanced data set is classified by the classifier, and the classifier adjusts the weight of the number of misclassified samples to improve the generalization performance of the classification model; among them, the AdaBoost M1 method is used to train the classifier for multiple iterations. The weak classifier in the training set, and each trained weak classifier will participate in the next iterative training; according to the results of the previous iteration, increase the weight of the sample points that are misclassified in the majority class in the training set, and at the same time correct the correct The weight of the classification sample points is reduced to enter the next iteration and improve the classification performance of the classifier; the classifier generated in the next iteration pays more attention to the wrong samples classified by the classifier in the previous iteration, so as to increase the correct rate of sample classification. ; Finally, the classification result is determined by voting according to the classifier generated by each iteration;

3)分类器评估步骤：通过混淆矩阵，漏报率、准确率，ROC曲线来对分类器进行评价；混淆矩阵，用来比较分类结果跟实际结果，可视化描绘出分类器性能的指标；3) Classifier evaluation step: evaluate the classifier through confusion matrix, false negative rate, accuracy rate, and ROC curve; confusion matrix is used to compare the classification results with the actual results, and visualize the indicators of the classifier performance;

漏报率、准确率采用如下公式：The false negative rate and accuracy rate are calculated by the following formulas:

漏报率＝TP/(FP+TN)False negative rate = TP/(FP+TN)

准确率＝(TP+TN)/(TP+TN+FN+FP)Accuracy = (TP+TN)/(TP+TN+FN+FP)

其中，FP是被判定为正样本，但实际上是负样本；FN是被判定为负样本，但实际上是正样本；TN是被判定为负样本，实际上也是负样本；TP是被判定为正样本，实际上也是正样本；Among them, FP is judged as a positive sample, but is actually a negative sample; FN is judged as a negative sample, but is actually a positive sample; TN is judged as a negative sample, but is actually a negative sample; TP is judged as a negative sample Positive samples are actually positive samples;

ROC曲线通常用来表示模型分类器的效果，在最佳状态下，ROC应该在左上角，这表示在较低假阳率的情况下有高真阳性率；ROC曲线的横轴是伪阳性率FPR，纵轴真阳性率TPR；The ROC curve is usually used to represent the effect of the model classifier. In the best state, the ROC should be in the upper left corner, which indicates a high true positive rate at a lower false positive rate; the horizontal axis of the ROC curve is the false positive rate FPR, the vertical axis true positive rate TPR;

TPR＝TP/((TP+FN))TPR=TP/((TP+FN))

真阳性率表示模型把正常样本预测为正常样本的数量与所有预测为正常样本数量的比值；The true positive rate represents the ratio of the number of normal samples predicted by the model as normal samples to the number of all predicted normal samples;

FRP＝FP/((FP+TN))FRP=FP/((FP+TN))

伪阳性率表示模型把正常样本预测为攻击类型的数量与所有预测为攻击类型的样本的数量的比值。The false positive rate represents the ratio of the number of normal samples predicted by the model as attack types to the total number of samples predicted as attack types.

进一步的，每次迭代产生的分类器根据分类错误率来获取最后组成强分类器时所占比重；分类错误率越低，权重越高。Further, the classifier generated in each iteration obtains the proportion of the final strong classifier according to the classification error rate; the lower the classification error rate, the higher the weight.

进一步的，粗聚类方法使用欧式距离计算样本点到质心的距离，与设定的距离阈值T₁、T₂进行比较，最后根据每个类别中样本点数以及与每个质心的距离筛选出数据集中的干扰点，并删除噪声样本点；具体包括步骤如下：Further, the coarse clustering method uses the Euclidean distance to calculate the distance from the sample point to the centroid, compares it with the set distance thresholds T ₁ and T ₂ , and finally filters out the data according to the number of sample points in each category and the distance from each centroid. Concentrated interference points, and delete noise sample points; the specific steps are as follows:

1.1.1)将原始样本集随机排列成样本列表L＝{x₁,x₂,…,x_n}，根据交叉验证调参设定初始距离阈值T₁、T₂(T₁>T₂)；1.1.1) Randomly arrange the original sample set into a sample list L={x ₁ ,x ₂ ,...,x _n }, and set the initial distance thresholds T ₁ and T ₂ according to the cross-validation parameter adjustment (T ₁ >T ₂ ) ;

1.1.2)随机从列表L中选取一个样本点x_i，i∈(1,n)，作为第一个Canopy聚类的质心，并把x₁从列表中删除；1.1.2) Randomly select a sample point x _i from the list L, i∈(1,n), as the centroid of the first Canopy cluster, and delete x ₁ from the list;

1.1.3)随机从列表L中选取一个样本点x_p，p∈(1,n)p≠i，计算x_p到所有质心的距离，并检验最小距离D_min；1.1.3) Randomly select a sample point x _p from the list L, p∈(1,n)p≠i, calculate the distance from x _p to all centroids, and check the minimum distance D _min ;

若T₂≤D_min≤T₁，则给x_p一个弱标记，表示D_min属于此canopy簇，并加入；若D_min≤T₂，则给x_p一个强标记，表示D_min属于此canopy簇，且接近质心；并将x_p从列表中删除；若D_min>T₁，则x_p形成一个新的簇，并将x_p从列表中删除；If T ₂ ≤ D _min ≤ T ₁ , give x _p a weak label, indicating that D _min belongs to this canopy cluster, and join it; if D _min ≤ T ₂ , give x _p a strong label, indicating that D _min belongs to this canopy cluster cluster, and close to the centroid; and delete x _p from the list; if D _min >T ₁ , then x _p forms a new cluster, and delete x _p from the list;

1.1.4)重复步骤1.1.3)，直到列表中元素数变为零，删除canopy簇中样本点数少的簇，并将少数类样本点附近有超过一倍多数类样本点的噪声点删除。1.1.4) Repeat step 1.1.3) until the number of elements in the list becomes zero, delete the clusters with few sample points in the canopy cluster, and delete the noise points with more than double the majority class sample points near the minority class sample points.

进一步的，过采样方法通过随机寻找少数类样本点，在距离该样本点k个最临近类中寻找一点，进行重复插值，形成多个新的少数类样本，将新的少数类样本进行加入到数据集中；具体包括如下步骤：Further, the oversampling method randomly finds the minority class sample points, finds a point in the k nearest classes from the sample point, performs repeated interpolation, forms multiple new minority class samples, and adds the new minority class samples to the Data collection; specifically includes the following steps:

1.2.1)从数据集中挑选出一个少数类别的样本i，特征向量为x_i，i∈{1,...,T}；1.2.1) Select a minority class sample i from the data set, the feature vector is x _i , i∈{1,...,T};

1.2.2)从少数类别的全部T个样本中找到样本x_i的k个近邻，记为x_i(near)，near∈{1,…,k}；1.2.2) Find the k nearest neighbors of sample x _i from all T samples of the minority category, denoted as x _i(near) , near∈{1,…,k};

1.2.3)从这k个近邻中随机选取一个样本x_i(nn)，再生成一个(0,1)之间的随机数λ₁，从而合成一个新样本x_i1：1.2.3) Randomly select a sample x _i(nn ) from the k nearest neighbors, and then generate a random number λ ₁ between (0,1) to synthesize a new sample x _i1 :

x_i1＝x_i+λ₁·(x_i(nn)-x_i) (1)x _i1 =x _i +λ ₁ ·(x _i(nn) -x _i ) (1)

1.2.4)重复N次步骤1.2.3)，从而合成N个新样本：x_inew，new∈{1,...,N}。1.2.4) Repeat step 1.2.3) N times, thereby synthesizing N new samples: x _inew , new∈{1,...,N}.

本发明相比现有技术优点在于：Compared with the prior art, the present invention has the following advantages:

本发明本文从数据预处理方面与增强学习融合构成分类器，提出通过粗聚类方法、过采样方法和Adaboost M1的入侵检测分类方法，使用Canpoy簇进行粗聚类得到噪声点，去除噪声点，同时使用降采样方法将多数类别的数量降低，减少模型过拟合，然后用粗聚类方法线性合成少数类样本点，提高了少数类别的数量，从而减少了类间不平衡，形成平衡数据集，该平衡样本可以很好的弥补少数类别分类样本数量不足的缺点，又解决了随机采样时重要信息丢失的问题。与AdaBoostM1分类器结合，使用随机森林作为基分类器，随机选取特征子集的性能让数据的维度对分类的影响降低，在每次迭代的过程中获得局部最佳弱分类器，然后对样本的权重进行更新，虽然训练时间会增大，但是与原来的不平衡数据集在Adaboost M1分类器上的结果相比可以有效的提升对于少数类别的准确率，降低平均漏报率。The present invention combines data preprocessing and reinforcement learning to form a classifier, and proposes a coarse clustering method, an oversampling method and an intrusion detection classification method of Adaboost M1, using Canpoy clusters to perform coarse clustering to obtain noise points, and remove noise points. At the same time, the downsampling method is used to reduce the number of the majority class, reducing the model overfitting, and then the coarse clustering method is used to linearly synthesize the minority class sample points, which increases the number of minority classes, thereby reducing the imbalance between classes and forming a balanced data set , the balanced sample can make up for the shortcoming of insufficient number of classification samples of minority categories, and also solve the problem of loss of important information during random sampling. Combined with the AdaBoostM1 classifier, random forest is used as the base classifier, and the performance of random selection of feature subsets reduces the impact of the data dimension on the classification. In the process of each iteration, the local optimal weak classifier is obtained. The weights are updated, although the training time will increase, but compared with the results of the original unbalanced dataset on the Adaboost M1 classifier, it can effectively improve the accuracy of the minority categories and reduce the average false negative rate.

附图说明Description of drawings

图1为本发明的平衡数据集构造流程图；Fig. 1 is the flow chart of the balanced data set construction of the present invention;

图2为AdaBoostM1框架流程图；Figure 2 is a flowchart of the AdaBoostM1 framework;

图3为本发明实例中U2R在ROC曲线上的对比图；Fig. 3 is the contrast diagram of U2R on ROC curve in the example of the present invention;

图4为本发明实例中R2L在ROC曲线上的对比图。FIG. 4 is a comparison diagram of R2L on the ROC curve in the example of the present invention.

具体实施方式Detailed ways

下面给出具体实施方式对本发明进一步说明。Specific embodiments are given below to further illustrate the present invention.

如图1至图4所示，一种针对入侵数据的检测方法，具体包括如下步骤：As shown in Figures 1 to 4, a method for detecting intrusion data specifically includes the following steps:

1)均衡数据集获取步骤：使用粗聚类方法将训练数据根据欧氏距离计算样本点到簇中心点的距离，并分为多个聚类子集，将包含样本点较少，距离较远的聚类子集视为噪声点，删除这些噪声数据；然后对入侵数据的不同类别进行随机采样并进行降维，降低入侵数据的不同类别模型的过拟合，并通过过采样方法对训练集进行类内均衡，增加部分类别的样本数量，得到均衡数据集。1) The steps of obtaining a balanced data set: Use the coarse clustering method to calculate the distance from the sample point to the cluster center point from the training data according to the Euclidean distance, and divide it into multiple cluster subsets, which will contain fewer sample points and farther distances. The clustered subsets of the intrusion data are regarded as noise points, and these noise data are deleted; then different categories of intrusion data are randomly sampled and dimensionality reduction is carried out to reduce the overfitting of different categories of intrusion data models, and the training set is adjusted by the oversampling method. Perform intra-class equalization, increase the number of samples in some categories, and obtain a balanced data set.

其中，粗聚类方法使用欧式距离计算样本点到质心的距离，与设定的距离阈值T₁、T₂进行比较，最后根据每个类别中样本点数以及与每个质心的距离筛选出数据集中的干扰点，并删除噪声样本点；具体包括步骤如下：Among them, the coarse clustering method uses the Euclidean distance to calculate the distance from the sample point to the centroid, compares it with the set distance thresholds T ₁ and T ₂ , and finally filters out the data set according to the number of sample points in each category and the distance from each centroid. and delete the noise sample points; the specific steps are as follows:

过采样方法通过随机寻找少数类样本点，在距离该样本点k个最临近类中寻找一点，进行重复插值，形成多个新的少数类样本，将新的少数类样本进行加入到数据集中；具体包括如下步骤：The oversampling method randomly finds the minority class sample points, finds a point in the k nearest classes from the sample point, performs repeated interpolation, forms multiple new minority class samples, and adds the new minority class samples to the data set; Specifically include the following steps:

x_i1＝x_i+λ₁·(x_i(nn)-x_i) (1)x _i1 =x _i +λ ₁ ·(x _i(nn) -x _i ) (1)

漏报率＝TP/(FP+TN)False negative rate = TP/(FP+TN)

准确率＝(TP+TN)/(TP+TN+FN+FP)Accuracy = (TP+TN)/(TP+TN+FN+FP)

TPR＝TP/((TP+FN))TPR=TP/((TP+FN))

FRP＝FP/((FP+TN))FRP=FP/((FP+TN))

每次迭代产生的分类器根据分类错误率来获取最后组成强分类器时所占比重；分类错误率越低，权重越高。The classifier generated in each iteration obtains the proportion of the final strong classifier according to the classification error rate; the lower the classification error rate, the higher the weight.

具体的，实验使用数据集为KDD CUP 99数据集。该数据集含有大量网络流量数据，大概包含有5,000,000多个网络连接记录，同时包含有测试数据大约2,000,0000。为避免数据量过大，按10％比例对数据集进行随机抽样，把抽样结果作为学习的训练集，使用测试数据的10％作为测试集。可以有效的减少建立模型的时间，同时在精度上影响较小。本实验使用的训练数据集包含49,399条训练数据。数据集中共有41个特征，4种攻击类型，分别是拒绝服务攻击(Denial of Service，DOS)，源于远程主机的权限获取攻击(Remote to Local，R2L)，端口监视扫描攻击(PROBE)，提权攻击(User to Root，U2R)。具体数据集攻击类型数目分布如下表1所示：Specifically, the data set used in the experiment is the KDD CUP 99 data set. The dataset contains a large amount of network traffic data, including more than 5,000,000 network connection records, and about 2,000,000 test data. In order to avoid too large amount of data, the data set is randomly sampled according to the proportion of 10%, the sampling result is used as the training set for learning, and 10% of the test data is used as the test set. It can effectively reduce the time to build the model, and at the same time, the impact on the accuracy is small. The training dataset used in this experiment contains 49,399 pieces of training data. There are 41 features in the data set and 4 attack types, namely Denial of Service (DOS), Privilege Acquisition Attack (Remote to Local, R2L) originating from a remote host, Port Monitoring Scanning Attack (PROBE), Right attack (User to Root, U2R). The distribution of the number of attack types in the specific dataset is shown in Table 1 below:

数据类型type of data 训练数据量amount of training data 测试集test set NormalNormal 9727897278 60596059 DoSDoS 391458391458 22982298 ProbeProbe 41074107 416416 R2LR2L 11261126 161161 U2RU2R 5252 228228

表1Table 1

因为原始数据集中样本不平衡，U2R的数量远远少于DOS和Normal，因此会对结果造成影响，影响模型的泛化性能，采用Canopy去除噪声点，粗聚类方法来提高少数样本U2R和R2L的样本数量，同时对含有较多记录的DOS和Normal类型进行降采样，然后根据将人工合成的U2R和R2L类型的记录与降采样得到的数据与Probe类型的数据混合成为一个新的平衡数据集。本方案处理后的均衡数据集数据分布如下表2所示：Because the samples in the original data set are unbalanced, the number of U2R is far less than DOS and Normal, which will affect the results and affect the generalization performance of the model. Canopy is used to remove noise points, and coarse clustering methods are used to improve the U2R and R2L of a few samples. At the same time, down-sampling the DOS and Normal types with more records, and then mixes the synthetic U2R and R2L type records and the data obtained by down-sampling with the Probe type data to form a new balanced data set . The data distribution of the balanced data set processed by this scheme is shown in Table 2 below:

数据类型type of data 训练数据量amount of training data NormalNormal 97279727 DoSDoS 78297829 ProbeProbe 41074107 R2LR2L 11261126 U2RU2R 10931093

表2Table 2

具体的，首先将随机抽样得到的不平衡训练集使用10折交叉验证，把数据集中的数据均等分为10份，选取一份做验证集，其余九份做训练，依次迭代10次。最后使用10个模型的平均实验结果作为整个模型的结果，使用测试集进行测试，将在此训练集中训练的模型命名为AdaboostM1。将平衡数据集使用10折交叉验证，把随机森林作为基分类器，随机森林能够处理高维度数据不需要进行特征选择，对于不平衡的数据集可以平衡误差，因此可以与AdaboostM1进行结合。把在平衡数据集上训练得出的模型命名为SAdaboostM1，在原始数据集上训练得出的模型命名为AdaboostM1。Specifically, 10-fold cross-validation is used for the unbalanced training set obtained by random sampling, and the data in the data set is equally divided into 10 parts, one part is selected as the verification set, and the remaining nine parts are used for training, and iterates 10 times in turn. Finally, the average experimental result of 10 models is used as the result of the whole model, and the test set is used for testing, and the model trained in this training set is named AdaboostM1. Using 10-fold cross-validation for the balanced data set, and using random forest as the base classifier, random forest can handle high-dimensional data without feature selection, and can balance the error for unbalanced data sets, so it can be combined with AdaboostM1. The model trained on the balanced dataset is named SAdaboostM1, and the model trained on the original dataset is named AdaboostM1.

使用测试集分别对两种模型进行检验，SAdaboostM1模型得到混淆矩阵如下表3所示，AdaboostM1模型得到的混淆矩阵如下表4所示。表5是AdaboostM1模型和SAdaboostM1模型在测试集上进行测试得到的每个类别漏报率结果。表6是AdaboostM1模型和SAdaboostM1模型在测试集上进行测试得到的每个类别准确率结果。Use the test set to test the two models respectively. The confusion matrix obtained by the SAdaboostM1 model is shown in Table 3 below, and the confusion matrix obtained by the AdaboostM1 model is shown in Table 4 below. Table 5 shows the false negative rate results of each category obtained by testing the AdaboostM1 model and the SAdaboostM1 model on the test set. Table 6 shows the accuracy results of each category obtained by testing the AdaboostM1 model and the SAdaboostM1 model on the test set.

表3table 3

表4Table 4

分类器Classifier R2LR2L DOSDOS PROBEPROBE U2RU2R AdaboostM1AdaboostM1 84.484.4 2.92.9 36.136.1 88.788.7 SAdaboostM1SAdaboostM1 58.658.6 2.62.6 15.515.5 67.267.2

表5table 5

分类器Classifier R2LR2L DOSDOS PROBEPROBE U2RU2R AdaboostM1AdaboostM1 15.615.6 97.197.1 73.973.9 13.313.3 SAdaboostM1SAdaboostM1 30.330.3 97.497.4 81.681.6 42.842.8

表6Table 6

通过表5，表6可以看到在使用了本方案对样本处理之后，减少了噪声，解决了样本不平衡造成的少类别样本误差问题。在没有改变原有整体准确率的前提下，极大的提高了U2R、R2L的准确率，同时减少了漏报率。From Table 5 and Table 6, it can be seen that after using this scheme to process the samples, the noise is reduced, and the problem of few-category sample errors caused by sample imbalance is solved. Under the premise of not changing the original overall accuracy, the accuracy of U2R and R2L has been greatly improved, and the false negative rate has been reduced.

下面比较U2R和R2L在两个模型上的ROC曲线。数据集中的少数类别U2R在SAdaboostM1模型上的ROC曲线如图3中左侧所示，AUC＝0.9779，而在AdaboostM1模型上的ROC曲线如图3中右侧所示。数据集中少数类别R2L在SAdaboostM1模型上的ROC曲线如图4左侧所示，AUC＝0.7091。数据集中少数类别R2L在AdaboostM1模型上的ROC曲线如图4右侧所示，AUC＝0.6486。The following compares the ROC curves of U2R and R2L on the two models. The ROC curve of the minority class U2R in the dataset on the SAdaboostM1 model is shown on the left in Figure 3, AUC=0.9779, while the ROC curve on the AdaboostM1 model is shown on the right in Figure 3. The ROC curve of the minority class R2L in the dataset on the SAdaboostM1 model is shown on the left side of Figure 4, with AUC=0.7091. The ROC curve of the minority class R2L in the dataset on the AdaboostM1 model is shown on the right side of Figure 4, with AUC=0.6486.

综上，网络环境中攻击行为多种多样，收集到的攻击数据类型样本数量不均衡，很难对少数类别的攻击行为进行判断，因此本方案使用Canopy去除噪声点，减少了在合成少数类别样本点时的误差，粗聚类方法将某种攻击数据量少的类别(R2L和U2R)进行人工合成数据，增加数据所占比例，并同时减少数量占比较多的类别(DOS和Normal)样本的数量，然后使用平衡数据集训练AdaboostM1分类器，与原始数据集下在AdaboostM1分类器上训练模型进行对比。实验得出在不减少整体数据集准确率的情况下，少数类别U2R攻击的准确率提升29％，R2L攻击的准确率提升15％，同时平均漏报率降低28％，本方案可以在不降低多数类别的分类准确率前提下有效提升少类别准确率，降低平均漏报率，有效的解决了网络入侵检测少数类误分类问题。To sum up, there are various attack behaviors in the network environment, and the number of collected attack data type samples is not balanced, so it is difficult to judge the attack behavior of a few categories. Therefore, this scheme uses Canopy to remove noise points, reducing the number of samples in the synthesis of a few categories. The error at the point, the coarse clustering method artificially synthesizes the data of a certain category with a small amount of attack data (R2L and U2R), increases the proportion of data, and reduces the number of categories with a large proportion of samples (DOS and Normal) at the same time. number, and then train the AdaboostM1 classifier with the balanced dataset to compare with the model trained on the AdaboostM1 classifier under the original dataset. The experiment shows that without reducing the accuracy of the overall dataset, the accuracy of the minority class U2R attack is increased by 29%, the accuracy of the R2L attack is increased by 15%, and the average false negative rate is reduced by 28%. Under the premise of the classification accuracy of most categories, it can effectively improve the accuracy of few categories, reduce the average false negative rate, and effectively solve the problem of misclassification of minority categories in network intrusion detection.

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员，在不脱离本发明构思的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明保护范围内。The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the concept of the present invention, several improvements and modifications can also be made, and these improvements and modifications should also be regarded as are within the protection scope of the present invention.

Claims

1. a detection method for intrusion data, is characterized in that, specifically comprises the steps:

1) The steps of obtaining a balanced data set: Use the coarse clustering method to calculate the distance from the sample point to the cluster center point from the training data according to the Euclidean distance, and divide it into multiple cluster subsets, which will contain fewer sample points and farther distances. The clustered subsets of the intrusion data are regarded as noise points, and these noise data are deleted; then different categories of intrusion data are randomly sampled and dimensionality reduction is performed to reduce the overfitting of different categories of intrusion data models, and the training set is adjusted by oversampling method. Perform intra-class balance, increase the number of samples in some categories, and obtain a balanced data set;

2) Data classification step: the balanced data set is classified by the classifier, and the classifier adjusts the weight of the number of misclassified samples to improve the generalization performance of the classification model; among them, the AdaBoost M1 method is used to train the classifier for multiple iterations. The weak classifier of , and each trained weak classifier will participate in the next iterative training; according to the results of the previous iteration, the weight of the misclassified sample points in the training set is increased, and the correctly classified sample points are reduce the weight of the classifier to enter the next iteration and improve the classification performance of the classifier; the classifier generated in the next iteration pays more attention to the wrong samples classified by the classifier of the previous iteration, so as to increase the correct rate of sample classification; finally, according to each The classifier generated by the next iteration votes to determine the classification result;

3) Classifier evaluation step: evaluate the classifier through confusion matrix, false negative rate, accuracy rate, and ROC curve; confusion matrix is used to compare the classification results with the actual results, and visualize the indicators of the classifier performance;

The false negative rate and accuracy rate are calculated by the following formulas:

False negative rate = TP/(FP+TN)

Accuracy = (TP+TN)/(TP+TN+FN+FP)

Among them, FP is judged as a positive sample, but is actually a negative sample; FN is judged as a negative sample, but is actually a positive sample; TN is judged as a negative sample, but is actually a negative sample; TP is judged as a negative sample Positive samples are actually positive samples;

The ROC curve is usually used to represent the effect of the model classifier. In the best state, the ROC should be in the upper left corner, which indicates a high true positive rate at a lower false positive rate; the horizontal axis of the ROC curve is the false positive rate FPR, the vertical axis true positive rate TPR;

TPR=TP/((TP+FN))

The true positive rate represents the ratio of the number of normal samples predicted by the model as normal samples to the number of all predicted normal samples;

FRP=FP/((FP+TN))

The false positive rate represents the ratio of the number of normal samples predicted by the model as attack types to the total number of samples predicted as attack types.

2. a kind of detection method for intrusion data according to claim 1, is characterized in that: the classifier that each iteration produces obtains the proportion when finally forming the strong classifier according to the classification error rate; the lower the classification error rate is , the higher the weight.

3. a kind of detection method for intrusion data according to claim 1 is characterized in that: rough clustering method uses Euclidean distance to calculate the distance from sample point to centroid, and compares with set distance thresholds T ₁ , T ₂ , and finally filter out the interference points in the dataset according to the number of sample points in each category and the distance from each centroid, and delete the noise sample points; the specific steps are as follows:

1.1.1) Randomly arrange the original sample set into a sample list L={x ₁ ,x ₂ ,...,x _n }, and set the initial distance thresholds T ₁ and T ₂ according to the cross-validation parameter adjustment (T ₁ >T ₂ ) ;

1.1.2) Randomly select a sample point _xi from the list L, i∈(1,n), as the centroid of the first Canopy cluster, and delete _xi from the list;

1.1.3) Randomly select a sample point x _p from the list L, p∈(1,n)p≠i, calculate the distance from x _p to all centroids, and check the minimum distance D _min ;

If T ₂ ≤ D _min ≤ T ₁ , give x _p a weak label, indicating that D _min belongs to this canopy cluster, and join it; if D _min <T ₂ , give x _p a strong label, indicating that D _min belongs to this canopy cluster cluster, and close to the centroid; and delete x _p from the list; if D _min >T ₁ , then x _p forms a new cluster, and delete x _p from the list;

1.1.4) Repeat step 1.1.3) until the number of elements in the list becomes zero, delete the clusters with few sample points in the canopy cluster, and delete the noise points with more than double the majority class sample points near the minority class sample points.

4. A kind of detection method for intrusion data according to claim 1, it is characterized in that: the oversampling method searches for a minority class sample point randomly, finds a point in the k nearest classes from the sample point, and performs repeated interpolation , forming multiple new minority class samples, and adding the new minority class samples to the data set; the specific steps include the following:

1.2.1) Select a minority class sample i from the data set, the feature vector is x _i , i∈{1,...,T};

1.2.2) Find the k nearest neighbors of sample x _i from all T samples of the minority category, denoted as x _i(near) , near∈{1,…,k};

1.2.3) Randomly select a sample x _i(nn ) from the k nearest neighbors, and then generate a random number λ ₁ between (0,1) to synthesize a new sample x _i1 :

x _i1 =x _i +λ ₁ ·(x _i(nn) -x _i ) (1)

1.2.4) Repeat step 1.2.3) N times, thereby synthesizing N new samples: x _inew , new∈{1,...,N}.