CN118282707A

CN118282707A - An Intrusion Detection Method Based on Incremental Training

Info

Publication number: CN118282707A
Application number: CN202410184475.4A
Authority: CN
Inventors: 丁熠; 李云杰; 秦志光; 刘瑶; 曹明生; 周尔强; 邓伏虎; 赵洋; 秦臻
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2024-02-19
Filing date: 2024-02-19
Publication date: 2024-07-02

Abstract

The invention discloses an intrusion detection method based on incremental training, which relates to the technical field of network security and comprises the following steps: extracting key characteristic data of the target network flow data; performing classification detection on the key characteristic data by using a recently updated integrated isolated forest detection model to obtain target abnormal flow data and target normal flow data; performing multi-classification detection on the target abnormal flow data by using the recently updated self-adaptive random forest model, labeling the target abnormal flow data, and noting the attack type of the target abnormal flow data; constructing a first increment training data set according to the target normal flow data and the historical network flow data, progressively updating the integrated isolated forest detection model, constructing a second increment training data set according to the target abnormal flow data, and updating the self-adaptive random forest model. The invention improves the comprehensive detection capability of the intrusion detection system.

Description

An Intrusion Detection Method Based on Incremental Training

技术领域Technical Field

本发明涉及网络安全技术领域，具体而言，涉及一种基于增量训练的入侵检测方法。The present invention relates to the field of network security technology, and in particular to an intrusion detection method based on incremental training.

背景技术Background technique

随着云计算、物联网和人工智能等技术的快速发展，网络的规模和复杂性不断增加，为攻击者提供了更多的目标和机会。网络攻击的方法和手段也日益多样化和隐蔽化，如恶意软件、零日漏洞利用、社交工程、钓鱼邮件、勒索软件等。With the rapid development of technologies such as cloud computing, the Internet of Things, and artificial intelligence, the scale and complexity of networks are increasing, providing attackers with more targets and opportunities. The methods and means of network attacks are also becoming increasingly diverse and covert, such as malware, zero-day vulnerability exploits, social engineering, phishing emails, ransomware, etc.

入侵检测在网络安全中发挥着重要的作用，入侵检测通过对网络流量、系统日志、用户行为等数据进行实时监测和分析，寻找与正常行为模式不符的异常活动。然而，虽然目前网络入侵检测已经取得了很大的成功，网络入侵检测的效率和准确性都较高，但是目前的入侵检测系统没有针对网络流量数据偏移的情况实现拟合学习，导致其检测的准确性随着时间流逝会出现不断下降的问题。针对这一场景，有必要提出一种增量训练的入侵检测方法来针对网络流量数据偏移情况进行分析学习，保证入侵检测系统的准确性。Intrusion detection plays an important role in network security. It monitors and analyzes network traffic, system logs, user behavior and other data in real time to find abnormal activities that do not conform to normal behavior patterns. However, although network intrusion detection has achieved great success and has high efficiency and accuracy, the current intrusion detection system does not implement fitting learning for the situation of network traffic data deviation, resulting in the problem that its detection accuracy will continue to decline over time. In view of this scenario, it is necessary to propose an incremental training intrusion detection method to analyze and learn the network traffic data deviation to ensure the accuracy of the intrusion detection system.

发明内容Summary of the invention

本发明在于提供一种基于增量训练的入侵检测方法，可以高效准确地实现对网络流量数据的检测分析，如果其是异常攻击数据还能够分析其攻击类型，从而为后续的针对性防御提供判断依据，此外，本发明能够针对检测结果数据进行增量学习，来应对网络流量数据偏移情况进行分析学习，保证入侵检测系统的准确性。The present invention provides an intrusion detection method based on incremental training, which can efficiently and accurately realize the detection and analysis of network traffic data. If it is abnormal attack data, its attack type can also be analyzed, thereby providing a judgment basis for subsequent targeted defense. In addition, the present invention can perform incremental learning on the detection result data to analyze and learn the deviation of network traffic data, thereby ensuring the accuracy of the intrusion detection system.

本发明采取的技术方案如下：The technical solution adopted by the present invention is as follows:

本发明提供了一种基于增量训练的入侵检测方法，包括以下步骤：The present invention provides an intrusion detection method based on incremental training, comprising the following steps:

S1、针对采集的目标网络流量数据进行预处理，提取关键特征数据；S1. Preprocess the collected target network traffic data and extract key feature data;

S2、利用最近更新过的集成式孤立森林检测模型对关键特征数据进行二分类检测，得到目标异常流量数据和目标正常流量数据，其中集成式孤立森林检测模型包括多个子孤立森林检测模型；S2. Use the recently updated integrated isolation forest detection model to perform binary classification detection on the key feature data to obtain target abnormal traffic data and target normal traffic data, wherein the integrated isolation forest detection model includes multiple sub-isolation forest detection models;

S3、利用最近更新过的自适应随机森林模型对目标异常流量数据进行多分类检测，给目标异常流量数据贴上标签，注明其攻击类型，根据目标异常流量数据所属的攻击类型采取对应的网络安全防御措施；S3. Use the recently updated adaptive random forest model to perform multi-classification detection on the target abnormal traffic data, label the target abnormal traffic data, indicate its attack type, and take corresponding network security defense measures according to the attack type to which the target abnormal traffic data belongs;

S4、随机采样目标正常流量数据，并与历史网络流量数据进行合并构造成第一增量训练数据集，随机采样的目标异常流量数据，对每一种异常类型进行收集，构造成第二增量训练数据集；S4, randomly sampling target normal traffic data, and merging it with historical network traffic data to construct a first incremental training data set, and randomly sampling target abnormal traffic data, collecting each abnormal type, and constructing a second incremental training data set;

S5、利用第一增量训练数据集训练新的子孤立森林检测模型，利用新的子孤立森林检测模型更换集成式孤立森林检测模型中低准确率的子孤立森林检测模型，完成集成式孤立森林检测模型的渐进式增量更新；S5. Using the first incremental training data set to train a new sub-isolation forest detection model, using the new sub-isolation forest detection model to replace the sub-isolation forest detection model with low accuracy in the integrated isolation forest detection model, and completing the progressive incremental update of the integrated isolation forest detection model;

S6、利用第二增量训练数据集对自适应随机森林模型进行训练更新。S6. Use the second incremental training data set to train and update the adaptive random forest model.

在本发明的一较佳实施方式中，步骤S1具体包括以下步骤：In a preferred embodiment of the present invention, step S1 specifically includes the following steps:

S101、针对采集的网络流量数据进行特征之间的相关度系数计算，得到特征间相关度系数；S101, calculating the correlation coefficient between features for the collected network traffic data to obtain the correlation coefficient between features;

S102、对特征间相关度系数进行排序，提取出大于设定阈值的特征间相关度系数对应的特征对，得到特征对集合；S102, sorting the correlation coefficients between features, extracting feature pairs corresponding to the correlation coefficients between features that are greater than a set threshold, and obtaining a feature pair set;

S103、对于特征对集合中的每一个特征对，随机删除其中一个特征，保留得到关键特征数据。S103. For each feature pair in the feature pair set, randomly delete one of the features and retain key feature data.

在本发明的一较佳实施方式中，对于最初的集成式孤立森林检测模型，其构建过程包括：In a preferred embodiment of the present invention, the construction process of the initial integrated isolation forest detection model includes:

S201、随机抽样采集历史网络流量数据，构造总训练数据集Φ；S201, randomly sampling and collecting historical network traffic data to construct a total training data set Φ;

S202、确定集成式孤立森林检测模型的最大子孤立森林检测模型个数L，以及用于存储集成式孤立森林检测模型的存储集合T；S202, determining the maximum number L of sub-isolation forest detection models of the integrated isolation forest detection model and a storage set T for storing the integrated isolation forest detection model;

S203、从总训练数据集Φ中随机抽样形成数据子集ψ；S203, randomly sampling from the total training data set Φ to form a data subset ψ;

S204、使用数据子集ψ训练生成子孤立森林检测模型，将训练好的子孤立森林检测模型放入存储集合T中；S204, using the data subset ψ to train and generate a sub-isolation forest detection model, and putting the trained sub-isolation forest detection model into a storage set T;

S205、若存储集合T内的子孤立森林检测模型个数达到最大子孤立森林检测模型个数L，则集成式孤立森林检测模型完成构建，否则跳转至步骤S203。S205. If the number of sub-isolation forest detection models in the storage set T reaches the maximum number of sub-isolation forest detection models L, the integrated isolation forest detection model is constructed, otherwise jump to step S203.

在本发明的一较佳实施方式中，利用集成式孤立森林检测模型对关键特征数据进行二分类检测的过程包括：In a preferred embodiment of the present invention, the process of performing binary classification detection on key feature data using an integrated isolation forest detection model includes:

对于集成式孤立森林检测模型中的每个子孤立森林检测模型，用其对关键特征数据进行二分类检测，得到检测结果；For each sub-isolation forest detection model in the integrated isolation forest detection model, use it to perform binary classification detection on the key feature data to obtain the detection result;

针对所有的检测结果进行硬投票汇总分析，通过投票结果确定集成式孤立森林的检测结果(集成式孤立森林的检测结果是由所有的子孤立森林检测结果统计得到的，通过少数服从多数来确认最后的结果，有可能与某些子孤立森林检测结果不同)。A hard voting summary analysis is performed on all the test results, and the test results of the integrated isolation forest are determined by the voting results (the test results of the integrated isolation forest are obtained by counting the test results of all the sub-isolation forests, and the final result is confirmed by the majority, which may be different from the test results of some sub-isolation forests).

在本发明的一较佳实施方式中，集成式孤立森林检测模型匹配有总检测条数计数器，用以记录一共检测了多少条网络流量数据；对于集成式孤立森林检测模型中的每个子孤立森林检测模型，均匹配有异常条数计数器，用以记录检测错误的网络流量数据条数；In a preferred embodiment of the present invention, the integrated isolation forest detection model is matched with a total detection count counter to record how many network flow data are detected in total; each sub-isolation forest detection model in the integrated isolation forest detection model is matched with an abnormal count counter to record the number of network flow data detected errors;

每次检测后，通过异常条数计数器，将与集成式孤立森林的检测结果不同的子孤立森林检测模型的错误检测条数加一。After each detection, the number of incorrect detections of the sub-isolation forest detection model that is different from the detection result of the integrated isolation forest is increased by one through the abnormal number counter.

在本发明的一较佳实施方式中，对于每个子孤立森林检测模型，用其对关键特征数据进行二分类检测的方法包括：In a preferred embodiment of the present invention, for each sub-isolation forest detection model, a method for using the sub-isolation forest detection model to perform binary classification detection on key feature data includes:

对于关键特征数据，令每一个数据点x_i遍历子孤立森林检测模型内部每一颗孤立树iTree，计算数据点在森林中的平均高度，对所有数据点的平均高度做归一化处理，数据点的异常值分数S的计算公式为：For key feature data, let each data point x _i traverse each isolated tree iTree inside the sub-isolation forest detection model, calculate the average height of the data point in the forest, and normalize the average height of all data points. The calculation formula of the outlier score S of the data point is:

其中，各函数公式如下：Among them, the function formulas are as follows:

H(i)≈ln(i)+0.5772156649H(i)≈ln(i)+0.5772156649

其中，x表示第x颗孤立树；n为孤立树的总数量；h(x)表示当前数据从当前孤立树的根节点到相应叶节点经过的边的数量，即数据在当前孤立树的路径长度；H(i)是调和数，通过欧拉常数进行估算；E(h(x))是计算数据在所有孤立树上的路径长度的平均值；当数据点的平均路径长度越小，异常值分数S越接近1，数据点为异常的概率越大；子孤立森林检测模型对关键特征数据进行二分类检测得到的检测结果，指检测得到的数据标签，为正常或者异常，此数据标签依赖于异常值分数S得到。Among them, x represents the xth isolated tree; n is the total number of isolated trees; h(x) represents the number of edges that the current data passes through from the root node of the current isolated tree to the corresponding leaf node, that is, the path length of the data in the current isolated tree; H(i) is the harmonic number, which is estimated by the Euler constant; E(h(x)) is the average path length of the calculated data on all isolated trees; when the average path length of the data point is smaller, the closer the outlier score S is to 1, the greater the probability that the data point is an anomaly; the detection result obtained by the sub-isolation forest detection model for binary classification of key feature data refers to the data label obtained by the detection, which is normal or abnormal. This data label depends on the outlier score S.

在本发明的一较佳实施方式中，对于最初的自适应随机森林算法模型，其构建过程包括：In a preferred embodiment of the present invention, the construction process of the initial adaptive random forest algorithm model includes:

S301、收集针对网络攻击产生的历史异常流量数据构造攻击流量数据集；S301, collecting historical abnormal traffic data generated by network attacks to construct an attack traffic data set;

S302、使用攻击流量数据集训练生成自适应随机森林算法模型。S302: Use the attack traffic data set to train and generate an adaptive random forest algorithm model.

在本发明的一较佳实施方式中，步骤S5包括以下步骤：In a preferred embodiment of the present invention, step S5 comprises the following steps:

S501、对于集成式孤立森林检测模型的每个子孤立森林检测模型，根据其异常条数计数器记录的检测错误的网络流量数据条数，以及集成式孤立森林检测模型的总检测条数计数器记录的网络流量数据检测总条数，计算其检测准确率，若其检测准确率低于准确率阈值，则将其标记为待更换模型；S501, for each sub-isolation forest detection model of the integrated isolation forest detection model, calculate its detection accuracy according to the number of network traffic data with detection errors recorded by its abnormal number counter and the total number of network traffic data detections recorded by the total number of detections counter of the integrated isolation forest detection model, and if its detection accuracy is lower than the accuracy threshold, mark it as a model to be replaced;

S502、统计待更换模型的数量k；S502, counting the number k of models to be replaced;

S503、对第一增量训练数据集进行随机采样，使用采样结果训练出k个新的子孤立森林检测模型，使用新的子孤立森林检测模型，更换集成式孤立森林检测模型中的k个待更换模型，完成集成式孤立森林检测模型的渐进式增量更新。S503. Randomly sample the first incremental training data set, use the sampling results to train k new sub-isolation forest detection models, use the new sub-isolation forest detection models to replace the k models to be replaced in the integrated isolation forest detection model, and complete the progressive incremental update of the integrated isolation forest detection model.

在本发明的一较佳实施方式中，步骤S6中，通过增量学习和新决策树的生成更新自适应随机森林模型，以适应新数据的分布和异常模式。In a preferred embodiment of the present invention, in step S6, the adaptive random forest model is updated through incremental learning and generation of new decision trees to adapt to the distribution and abnormal patterns of new data.

在本发明的一较佳实施方式中，集成式孤立森林检测模型和自适应随机森林模型的更新方式还包括直接输入模型新数据。In a preferred embodiment of the present invention, the updating method of the integrated isolation forest detection model and the adaptive random forest model also includes directly inputting new data into the model.

与现有技术相比，本发明的有益效果是：Compared with the prior art, the present invention has the following beneficial effects:

本发明首先针对收集的网络流量数据进行二分类检测，然后针对异常数据进行多分类检测以此判断攻击类型，此外还能够通过增量训练保证检测模型的时效性；The present invention first performs binary classification detection on the collected network traffic data, and then performs multi-classification detection on the abnormal data to determine the attack type. In addition, the timeliness of the detection model can be ensured through incremental training;

与现有的入侵检测方法相比，本发明将基于异常检测的入侵检测方法和基于特征识别的入侵检测方法进行融合，从而提升了入侵检测系统的综合检测能力；Compared with the existing intrusion detection methods, the present invention integrates the intrusion detection method based on anomaly detection and the intrusion detection method based on feature recognition, thereby improving the comprehensive detection capability of the intrusion detection system;

本发明提出的基于增量训练的入侵检测方法能够学习正常行为的模式并检测与之不符的异常行为，还能够针对异常数据通过比对已知的入侵特征来快速识别已知的攻击行为；The incremental training-based intrusion detection method proposed in the present invention can learn the normal behavior pattern and detect abnormal behavior that does not conform to it, and can also quickly identify known attack behaviors by comparing the abnormal data with known intrusion features;

此外，本发明基于增量训练，使得可以在已有模型的基础上引入新的样本进行学习，从而保留了历史知识，这对于长期积累的模型和数据非常重要，可以避免重新训练导致的知识丢失，使得系统能够灵活地应对动态变化的数据和环境，模型可以根据新样本的特点进行局部调整和更新，而无需重新训练整个模型，这种灵活性使得模型更具弹性，能够适应不断变化的需求和条件。In addition, the present invention is based on incremental training, so that new samples can be introduced for learning on the basis of the existing model, thereby retaining historical knowledge, which is very important for models and data accumulated over a long period of time. It can avoid knowledge loss caused by retraining, and enable the system to flexibly respond to dynamically changing data and environments. The model can be locally adjusted and updated according to the characteristics of the new samples without retraining the entire model. This flexibility makes the model more flexible and able to adapt to changing needs and conditions.

为使本发明的上述目的、特征和优点能更明显易懂，下文特举本发明实施例，并配合所附附图，作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, the embodiments of the present invention are specifically cited below and described in detail with reference to the attached drawings.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本发明的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for use in the embodiments are briefly introduced below. It should be understood that the following drawings only show certain embodiments of the present invention and therefore should not be regarded as limiting the scope. For ordinary technicians in this field, other related drawings can be obtained based on these drawings without creative work.

图1是本发明基于增量训练的入侵检测方法流程图；FIG1 is a flow chart of an intrusion detection method based on incremental training of the present invention;

图2是本发明关键特征数据的提取流程图；FIG2 is a flowchart of extracting key feature data of the present invention;

图3是本发明最初的集成式孤立森林检测模型的构建流程图；FIG3 is a flowchart of constructing the initial integrated isolation forest detection model of the present invention;

图4是本发明集成式孤立森林检测模型的渐进式增量更新流程图；FIG4 is a flowchart of the progressive incremental update of the integrated isolation forest detection model of the present invention;

图5为本发明方法实验中的增量训练效果展示图。FIG5 is a diagram showing the effect of incremental training in the experiment of the method of the present invention.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments.

本发明所述基于增量训练的入侵检测方法，其根据集成学习的思想提出渐进式增量训练的孤立森林检测算法，通过渐进式增量训练的孤立森林检测算法针对收集到的网络流量数据信息进行异常检测，然后针对检测出来的异常流量数据信息再通过自适应随机森林算法进行多分类检测，以此检测出异常流量数据所属的攻击类型，以此为防御反应操作提供参考和辅助。The intrusion detection method based on incremental training described in the present invention proposes an isolation forest detection algorithm of progressive incremental training according to the idea of ensemble learning, performs anomaly detection on the collected network traffic data information through the isolation forest detection algorithm of progressive incremental training, and then performs multi-classification detection on the detected abnormal traffic data information through an adaptive random forest algorithm, so as to detect the attack type to which the abnormal traffic data belongs, thereby providing reference and assistance for defense response operations.

此外，本发明还能够针对新数据进行增量训练，以此应对网络流量数据漂移的情况，本发明能够随着时间推移灵活地应对动态变化的数据和异常模式，更好地捕捉异常行为。In addition, the present invention can also perform incremental training on new data to cope with the situation of network traffic data drift. The present invention can flexibly respond to dynamically changing data and abnormal patterns over time and better capture abnormal behavior.

下面结合图1～图4对本发明方法进行具体实施过程的阐述包括以下步骤：The following description of the specific implementation process of the method of the present invention in conjunction with Figures 1 to 4 includes the following steps:

S1、针对采集的目标网络流量数据进行预处理，提取关键特征数据，如图2所示，具体为：S1. Preprocess the collected target network traffic data and extract key feature data, as shown in Figure 2, specifically:

S2、利用最近更新过的集成式孤立森林检测模型对关键特征数据进行二分类检测，得到目标异常流量数据和目标正常流量数据，其中集成式孤立森林检测模型包括多个子孤立森林检测模型。S2. Use the recently updated integrated isolation forest detection model to perform binary classification detection on the key feature data to obtain target abnormal traffic data and target normal traffic data, where the integrated isolation forest detection model includes multiple sub-isolation forest detection models.

如图3所示，对于最初的集成式孤立森林检测模型，其构建过程包括：As shown in Figure 3, the construction process of the initial integrated isolation forest detection model includes:

S201、随机抽样采集历史网络流量数据，构造总训练数据集Φ。S201. Randomly sample and collect historical network traffic data to construct a total training data set Φ.

S202、确定集成式孤立森林检测模型的最大子孤立森林检测模型个数L，以及用于存储集成式孤立森林检测模型的存储集合T。S202: Determine the maximum number L of sub-isolation forest detection models of the integrated isolation forest detection model and a storage set T for storing the integrated isolation forest detection model.

S203、从总训练数据集Φ中随机抽样形成数据子集ψ。S203, randomly sampling from the total training data set Φ to form a data subset ψ.

S204、使用数据子集ψ训练生成子孤立森林检测模型，具体如下：S204, using the data subset ψ to train and generate a sub-isolation forest detection model, as follows:

子孤立森林检测模型包含若干个iTree孤立树，每个iTree是一个二叉树结构，每次从数据子集ψ中随机选择样本点作为样本子集，放入树的根节点；The sub-isolation forest detection model contains several iTree isolation trees. Each iTree is a binary tree structure. Each time, a sample point is randomly selected from the data subset ψ as the sample subset and placed in the root node of the tree.

随机指定一个维度(特征)，在当前节点数据中随机产生一个切割点p(切割点产生于当前节点数据中指定维度的最大值和最小值之间)；Randomly specify a dimension (feature) and randomly generate a cutting point p in the current node data (the cutting point is generated between the maximum and minimum values of the specified dimension in the current node data);

以此切割点p生成了一个超平面，然后将当前节点数据空间划分为2个子空间：把指定维度里小于切割点p的数据放在当前节点的左子节点，把大于等于p的数据放在当前节点的右子节点。A hyperplane is generated with this cutting point p, and then the data space of the current node is divided into two subspaces: the data in the specified dimension that is smaller than the cutting point p is placed in the left child node of the current node, and the data that is greater than or equal to p is placed in the right child node of the current node.

在子节点中递归上述步骤，不断构造新的子节点，直到子节点中只有一个数据(无法再继续切割)或子节点已到达限定高度，此时完成当前iTree孤立树的创建，重复进行iTree孤立树的创建步骤，最终构成子孤立森林检测模型，将训练好的子孤立森林检测模型放入存储集合T中。Recursively repeat the above steps in the child nodes, and continuously construct new child nodes until there is only one data in the child node (no further cutting is possible) or the child node has reached the limited height. At this time, the creation of the current iTree isolated tree is completed, and the creation steps of the iTree isolated tree are repeated to finally form a sub-isolation forest detection model, and the trained sub-isolation forest detection model is placed in the storage set T.

利用集成式孤立森林检测模型对关键特征数据进行二分类检测的过程包括：The process of using the integrated isolation forest detection model to perform binary classification detection on key feature data includes:

针对所有的检测结果进行硬投票汇总分析，通过投票结果确定集成式孤立森林的检测结果，即得到目标异常流量数据和目标正常流量数据。A hard voting summary analysis is performed on all the detection results, and the detection results of the integrated isolation forest are determined by the voting results, that is, the target abnormal traffic data and the target normal traffic data are obtained.

在本发明中，集成式孤立森林检测模型匹配有总检测条数计数器，用以记录一共检测了多少条网络流量数据；对于集成式孤立森林检测模型中的每个子孤立森林检测模型，均匹配有异常条数计数器，用以记录检测错误的网络流量数据条数；In the present invention, the integrated isolation forest detection model is matched with a total detection count counter to record how many network flow data are detected in total; each sub-isolation forest detection model in the integrated isolation forest detection model is matched with an abnormal count counter to record the number of network flow data detected errors;

对于每个子孤立森林检测模型，用其对关键特征数据进行二分类检测的方法包括：For each sub-isolation forest detection model, the method of using it to perform binary classification detection on key feature data includes:

H(i)≈ln(i)+0.5772156649H(i)≈ln(i)+0.5772156649

S3、利用最近更新过的自适应随机森林模型对目标异常流量数据进行多分类检测，给目标异常流量数据贴上标签，注明其攻击类型，根据目标异常流量数据所属的攻击类型采取对应的网络安全防御措施。S3. Use the recently updated adaptive random forest model to perform multi-classification detection on the target abnormal traffic data, label the target abnormal traffic data, indicate its attack type, and take corresponding network security defense measures according to the attack type to which the target abnormal traffic data belongs.

对于最初的自适应随机森林算法模型，其构建过程包括：For the initial adaptive random forest algorithm model, the construction process includes:

S4、随机采样目标正常流量数据，并与历史网络流量数据进行合并构造成第一增量训练数据集，随机采样的目标异常流量数据，对每一种异常类型进行收集，构造成第二增量训练数据集。S4. Randomly sample target normal traffic data and merge it with historical network traffic data to construct a first incremental training data set. Randomly sample target abnormal traffic data, collect each abnormal type, and construct a second incremental training data set.

S5、利用第一增量训练数据集完成集成式孤立森林检测模型的渐进式增量更新，如图4所示，具体如下：S5. Using the first incremental training data set to complete the progressive incremental update of the integrated isolation forest detection model, as shown in FIG4 , specifically as follows:

S6、利用第二增量训练数据集对自适应随机森林模型进行训练更新，通过增量学习和新决策树的生成更新自适应随机森林模型，以适应新数据的分布和异常模式。S6. Use the second incremental training data set to train and update the adaptive random forest model. Update the adaptive random forest model through incremental learning and generation of new decision trees to adapt to the distribution and abnormal patterns of the new data.

图5为本发明方法实验中的增量训练效果展示图，其中，图中主要展示了各种用于衡量分类模型的性能的评估指标，其介绍如下：FIG5 is a diagram showing the effect of incremental training in the experiment of the method of the present invention, wherein the diagram mainly shows various evaluation indicators for measuring the performance of the classification model, which are introduced as follows:

首先需要介绍所有评估指标的基础：混淆矩阵。混淆矩阵是用于评估分类模型性能的一种表格形式，针对二分类问题，混淆矩阵中包含阳性(Positive)，阴性(Negative)，真(True)，假(False)四种概念。其通常意义如下：First, we need to introduce the basis of all evaluation indicators: confusion matrix. Confusion matrix is a table used to evaluate the performance of classification models. For binary classification problems, the confusion matrix contains four concepts: positive, negative, true, and false. Its general meaning is as follows:

称预测类别为正常的为阳性(Positive)，预测类别为异常的则为阴性(Negative)。The predicted category is called normal, which is called positive, and the predicted category is called abnormal, which is called negative.

称预测正确的为真(True)，预测错误的为假(False)。Correct predictions are called True, and incorrect predictions are called False.

针对如上概念，进行排列组合即可得出4个关键指标：真阳性(True Positive，TP)，即将正常的事件预测为正常；假阳性(False Positive，FP)，即将异常的样本预测为正常；假阴性(Flase Negative，FN)，即将正常的样本预测为异常；真阴性(True Negative，TN)，即将异常的样本预测为异常。将其进行组合即可得到混淆矩阵。而在多分类的情况下，通常是将其转化为二分类情况，将待评估的类看作正常类，其他所有的类别看作异常类，以此构建混淆矩阵。For the above concepts, we can get four key indicators by permuting and combining them: True Positive (TP), which means predicting normal events as normal; False Positive (FP), which means predicting abnormal samples as normal; False Negative (FN), which means predicting normal samples as abnormal; True Negative (TN), which means predicting abnormal samples as abnormal. Combining them will get the confusion matrix. In the case of multi-classification, it is usually converted into a binary classification, where the class to be evaluated is regarded as the normal class and all other classes are regarded as abnormal classes, so as to construct the confusion matrix.

(1)准确率(Accuracy)：是衡量模型正确分类样本的比例，其代表了模型对正常样本和异常样本进行正确分类的能力，准确率越高意味着模型的性能越好。准确率是通过真阳性(True Positive，TP)样本数量和真阴性(True Negative，TN)样本数量相加与所有样本的比例。其计算公式如下：(1) Accuracy: It measures the proportion of samples correctly classified by the model. It represents the model's ability to correctly classify normal samples and abnormal samples. The higher the accuracy, the better the model performance. Accuracy is the ratio of the sum of the number of true positive (TP) samples and the number of true negative (TN) samples to all samples. The calculation formula is as follows:

(2)精确率(Precision)：表示在所有被模型预测为正类别的样本中，有多少是真正的正类别，即用于衡量模型在正类别预测方面的准确性，即模型在标识为正类别的情况下，有多大的概率是正确的。高精确率意味着在模型标识为正类别的情况下，它更倾向于是正确的。其计算公式如下：(2) Precision: Indicates how many of the samples predicted as positive by the model are actually positive. It is used to measure the accuracy of the model in predicting positive categories, that is, how likely the model is correct when it identifies a positive category. A high precision means that when the model identifies a positive category, it is more likely to be correct. The calculation formula is as follows:

(3)召回率(Recall)：表示在所有实际正类别的样本中，有多少被成功地预测为正类别。召回率衡量了模型对正类别样本的识别能力，即模型有多大程度地找出了真正的正类别。高召回率意味着模型能够较好地捕捉到实际正类别样本。其计算公式如下：(3) Recall: Indicates how many of all actual positive samples are successfully predicted as positive. Recall measures the model's ability to identify positive samples, that is, to what extent the model finds the true positive samples. A high recall means that the model can better capture the actual positive samples. Its calculation formula is as follows:

(4)F1分数(F1-score)：是精确率和召回率的调和平均，用于全面评估模型性能。F1分数是综合考虑精确率和召回率的一项性能指标。F1分数的取值范围在0到1之间，越接近1表示模型的性能越好。在某些任务中，精确率和召回率可能有着同样的重要性，为了平衡二者，使得模型发挥出最大的性能，通过观测F1分数从而对模型的分类阈值或者模型参数进行调整，即可在精确率和召回率之间找到平衡，从而提高模型的性能。其计算公式如下：(4) F1-score: It is the harmonic mean of precision and recall and is used to comprehensively evaluate model performance. The F1-score is a performance indicator that takes precision and recall into account. The F1-score ranges from 0 to 1, and the closer it is to 1, the better the model performance. In some tasks, precision and recall may be equally important. In order to balance the two and maximize the performance of the model, the classification threshold or model parameters of the model can be adjusted by observing the F1-score to find a balance between precision and recall, thereby improving the performance of the model. The calculation formula is as follows:

另外，图5中还有“macro avg”和“weighted avg”两个指标。“macro avg”为宏平均，即为每个类的对应指标的算数平均值。而“weighted avg”是权重平均值，即通过该类型的数量占所有类型数量和的比例作为权重，然后对每个类的对应指标进行加权平均。In addition, there are two indicators in Figure 5: "macro avg" and "weighted avg". "macro avg" is the macro average, that is, the arithmetic average of the corresponding indicators of each class. "weighted avg" is the weighted average, that is, the proportion of the number of this type to the total number of all types is used as the weight, and then the corresponding indicators of each class are weighted averaged.

在图5中可以看出网络攻击异常检测部分在增量训练后，其准确率、精确率、召回率都有了一定的提升，说明了本发明的渐进式增量训练的孤立森林检测算法模型在攻击检测中能够通过增量训练来对数据进行进一步学习分析，从而提高模型的性能。As can be seen in Figure 5, after incremental training, the accuracy, precision and recall of the network attack anomaly detection part have been improved to a certain extent, indicating that the isolation forest detection algorithm model of the progressive incremental training of the present invention can further learn and analyze the data through incremental training in attack detection, thereby improving the performance of the model.

然后是网络攻击多分类检测部分的增量训练结果。其中分类报告中的“normal”，“dos”，“probe”，“r2l”，“u2r”代表不同的数据类型信息，其分别代表着“正常”，“dos攻击”，“Probe侦察”，“R2L远程到本地攻击”和“U2R用户到根攻击”五种数据类别。通过分析图5可以看出，在增量训练之前，对于“probe”，“r2l”，“u2r”三种攻击类型的检测的精确率、召回率和F1分数都较低，这是由于增量训练之前相关攻击类型数据较少导致模型对其学习分析不足导致的。而在增量训练之后，由于增加了“probe”，“r2l”，“u2r”三种攻击类型数据，使得模型逐渐学习了这些攻击类型的攻击特征信息，使得在测试中这些攻击类型的精确率、召回率和F1分数都有了明显提升，从而说明了本发明的自适应随机森林模型在攻击分类上能够做到持续学习，从而对各种攻击类型的攻击特征进行充分学习，提升自身的攻击分类能力。Then there are the incremental training results of the multi-classification detection part of network attacks. Among them, "normal", "dos", "probe", "r2l", and "u2r" in the classification report represent different data type information, which respectively represent five data categories of "normal", "dos attack", "Probe reconnaissance", "R2L remote to local attack" and "U2R user to root attack". By analyzing Figure 5, it can be seen that before incremental training, the precision, recall rate and F1 score of the detection of the three attack types of "probe", "r2l" and "u2r" are all low. This is because the model is insufficient to learn and analyze the three attack types due to the lack of relevant attack type data before incremental training. After incremental training, due to the addition of the three attack type data of "probe", "r2l" and "u2r", the model gradually learns the attack feature information of these attack types, so that the precision, recall rate and F1 score of these attack types in the test have been significantly improved, which shows that the adaptive random forest model of the present invention can achieve continuous learning in attack classification, so as to fully learn the attack features of various attack types and improve its own attack classification ability.

综上可得，本发明提出的两个模型在保证了高性能指标的基础上，能够持续性的进行增量学习，使得模型能够对新数据进行学习以此保证模型的性能。In summary, the two models proposed in the present invention can continuously perform incremental learning on the basis of ensuring high performance indicators, so that the model can learn new data to ensure the performance of the model.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and variations. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.

Claims

1. An intrusion detection method based on incremental training, characterized in that it comprises the following steps:

S1. Preprocess the collected target network traffic data and extract key feature data;

S2. Use the recently updated integrated isolation forest detection model to perform binary classification detection on the key feature data to obtain target abnormal traffic data and target normal traffic data, wherein the integrated isolation forest detection model includes multiple sub-isolation forest detection models;

S3. Use the recently updated adaptive random forest model to perform multi-classification detection on the target abnormal traffic data, label the target abnormal traffic data, indicate its attack type, and take corresponding network security defense measures according to the attack type to which the target abnormal traffic data belongs;

S4, randomly sampling target normal traffic data, and merging it with historical network traffic data to construct a first incremental training data set, and randomly sampling target abnormal traffic data, collecting each abnormal type, and constructing a second incremental training data set;

S5. Using the first incremental training data set to train a new sub-isolation forest detection model, using the new sub-isolation forest detection model to replace the sub-isolation forest detection model with low accuracy in the integrated isolation forest detection model, and completing the progressive incremental update of the integrated isolation forest detection model;

S6. Use the second incremental training data set to train and update the adaptive random forest model.

2. According to the incremental training-based intrusion detection method of claim 1, step S1 specifically comprises the following steps:

S101, calculating the correlation coefficient between features for the collected network traffic data to obtain the correlation coefficient between features;

S102, sorting the correlation coefficients between features, extracting feature pairs corresponding to the correlation coefficients between features that are greater than a set threshold, and obtaining a feature pair set;

S103. For each feature pair in the feature pair set, randomly delete one of the features and retain key feature data.

3. The intrusion detection method based on incremental training according to claim 1 is characterized in that, for the initial integrated isolation forest detection model, its construction process includes:

S201, randomly sampling and collecting historical network traffic data to construct a total training data set Φ;

S202, determining the maximum number L of sub-isolation forest detection models of the integrated isolation forest detection model and a storage set T for storing the integrated isolation forest detection model;

S203, randomly sampling from the total training data set Φ to form a data subset ψ;

S204, using the data subset ψ to train and generate a sub-isolation forest detection model, and putting the trained sub-isolation forest detection model into a storage set T;

S205. If the number of sub-isolation forest detection models in the storage set T reaches the maximum number of sub-isolation forest detection models L, the integrated isolation forest detection model is constructed, otherwise jump to step S203.

4. The intrusion detection method based on incremental training according to claim 3 is characterized in that the process of using the integrated isolation forest detection model to perform binary classification detection on key feature data includes:

For each sub-isolation forest detection model in the integrated isolation forest detection model, use it to perform binary classification detection on the key feature data to obtain the detection result;

A hard voting summary analysis is performed on all the detection results, and the detection results of the integrated isolation forest are determined based on the voting results.

5. According to the incremental training-based intrusion detection method of claim 4, it is characterized in that the integrated isolation forest detection model is matched with a total detection number counter to record how many network flow data are detected in total; for each sub-isolation forest detection model in the integrated isolation forest detection model, an abnormal number counter is matched to record the number of network flow data detected errors;

After each detection, the number of incorrect detections of the sub-isolation forest detection model that is different from the detection result of the integrated isolation forest is increased by one through the abnormal number counter.

6. The intrusion detection method based on incremental training according to claim 4 is characterized in that, for each sub-isolation forest detection model, the method of using it to perform binary classification detection on key feature data includes:

For key feature data, let each data point x _i traverse each isolated tree iTree inside the sub-isolation forest detection model, calculate the average height of the data point in the forest, and normalize the average height of all data points. The calculation formula of the outlier score S of the data point is:

The function formulas are as follows:

H(i)≈ln(i)+0.5772156649

Where x represents the xth isolated tree; n is the total number of isolated trees; h(x) represents the number of edges that the current data passes through from the root node of the current isolated tree to the corresponding leaf node, that is, the path length of the data in the current isolated tree; H(i) is the harmonic number, which is estimated by the Euler constant; E(h(x)) is the average path length of the calculated data on all isolated trees; when the average path length of the data point is smaller, the outlier score S is closer to 1, and the probability that the data point is an anomaly is greater;

The detection result obtained by the sub-isolation forest detection model through binary classification detection of key feature data refers to the data label obtained by the detection, which is normal or abnormal. This data label depends on the outlier score S.

7. The intrusion detection method based on incremental training according to claim 4 is characterized in that, for the initial adaptive random forest algorithm model, its construction process includes:

S301, collecting historical abnormal traffic data generated by network attacks to construct an attack traffic data set;

S302: Use the attack traffic data set to train and generate an adaptive random forest algorithm model.

8. The intrusion detection method based on incremental training according to claim 4, characterized in that step S5 comprises the following steps:

S501, for each sub-isolation forest detection model of the integrated isolation forest detection model, calculate its detection accuracy according to the number of network traffic data with detection errors recorded by its abnormal number counter and the total number of network traffic data detections recorded by the total number of detections counter of the integrated isolation forest detection model, and if its detection accuracy is lower than the accuracy threshold, mark it as a model to be replaced;

S502, counting the number k of models to be replaced;

S503. Randomly sample the first incremental training data set, use the sampling results to train k new sub-isolation forest detection models, use the new sub-isolation forest detection models to replace the k models to be replaced in the integrated isolation forest detection model, and complete the progressive incremental update of the integrated isolation forest detection model.

9. The intrusion detection method based on incremental training according to claim 4 is characterized in that in step S6, the adaptive random forest model is updated through incremental learning and generation of new decision trees to adapt to the distribution and abnormal patterns of new data.

10. According to claim 1, the intrusion detection method based on incremental training is characterized in that the updating method of the integrated isolation forest detection model and the adaptive random forest model also includes directly inputting new data into the model.