CN107682319B

CN107682319B - A method for data flow anomaly detection and multiple verification based on enhanced angle anomaly factor

Info

Publication number: CN107682319B
Application number: CN201710823063.0A
Authority: CN
Inventors: 首照宇; �田�浩; 邹风波; 张彤; 程夏威; 文辉; 赵晖; 莫建文; 汪延国; 曾情; 李希成
Original assignee: Guilin Yuhui Information Technology Co ltd; Guilin University of Electronic Technology
Current assignee: Guilin Yuhui Information Technology Co ltd; Guilin University of Electronic Technology
Priority date: 2017-09-13
Filing date: 2017-09-13
Publication date: 2020-07-03
Anticipated expiration: 2037-09-13
Also published as: CN107682319A

Abstract

The method is characterized by comprising the following steps of 1) processing a real-time data stream, 2) setting a data set S in a sliding window, 3) initializing parameters k, r and ξ, 4) obtaining a distance matrix dist, 5) obtaining a r neighborhood point set, and 6) obtaining an angle factor of the r neighborhood point set

And local density

7) Obtaining dissimilarity degrees; 8) acquiring a cluster center factor of each data point; 9) acquiring an attribution matrix; 10) determining a cluster center and clustering; 11) respectively carrying out anomaly detection on each clustered cluster; 12) and (5) performing multiple verification. The method applies sliding window and basic window technologies, constructs an efficient data stream processing model, reduces the occupancy rate of the memory, and has good real-time performance, high accuracy of abnormal detection and low time complexity.

Description

A Data Stream Anomaly Detection and Multiple Verification Based on Enhanced Angle Anomaly Factor Methods

技术领域technical field

本发明涉及数据流异常检测和数据聚类，尤其涉及一种基于增强型角度异常因子的数据流异常检测及多重验证的方法。The invention relates to data flow anomaly detection and data clustering, in particular to a method for data flow anomaly detection and multiple verification based on an enhanced angle anomaly factor.

背景技术Background technique

网络技术的飞速发展和社会信息化的不断提高，引发了信息量的爆炸式增长，使得各行各业产生了海量、高速、动态的流数据，如网络入侵监测、商业交易管理和分析、视频监控、传感网络监控等。由于动态数据流的实时无限等特点，传统的静态数据异常检测方法已不能准确有效的分析和处理如此大规模动态增长的流数据，因此构建一种适用于数据流的实时有效异常检测方法变得尤其重要。The rapid development of network technology and the continuous improvement of social informatization have led to the explosive growth of the amount of information, resulting in the generation of massive, high-speed and dynamic streaming data in all walks of life, such as network intrusion monitoring, business transaction management and analysis, and video surveillance. , sensor network monitoring, etc. Due to the infinite real-time characteristics of dynamic data streams, traditional static data anomaly detection methods can no longer accurately and effectively analyze and process such large-scale and dynamically growing streaming data. Therefore, building a real-time effective anomaly detection method suitable for data streams becomes Especially important.

针对不同阶段面临的实际问题，科技工作者提出了不同的数据流异常检测方法。现有的数据流异常检测方法大致可分为基于密度的数据流异常检测方法、基于角度的数据流异常检测方法、基于聚类的数据流异常检测方法。基于密度的数据流异常检测方法运用密度作为最基本的异常度量方式，并构造出能动态更新的用于度量数据异常程度的异常因子，Pokrajac等人将静态数据异常检测方法LOF引用到数据流中，研究出一种能运用到动态数据流中的增量式局部异常检测方法INCLOF，随着新数据的插入，INCLOF删除历史数据并动态更新各数据点的异常因子；Ke Gao等人对INCLOF方法进行了改进，引入滑动窗口的思想，提出了n-INCLOF方法，n-INCLOF方法仅更新当前时刻滑动窗口内各数据对象的异常因子；在有些情况下，有的数据点在某一时刻表现为异常，但在下一时刻并不表现异常，基于这个问题，Karimian S H等人提出了I-IncLOF方法，I-IncLOF方法引入了多重验证的思想，I-IncLOF方法将那些在窗口的整个滑动过程中始终表现为异常的数据对象判定为确定异常点，I-IncLOF方法大大降低了误判率，但I-IncLOF方法在多维情况下有效性较差；XinjieLu等人提出了INCLOCI方法，INCLOCI引入多粒度异常因子MDEF，INCLOCI方法不仅能检测出分散的异常点，而且能检测出异常簇。为了解决距离、密度等相似度度量方式在高维数据空间中有效性降低的问题，一些科研工作者提出了角度的度量方式，角度的相似度度量的基本思想就是异常点与其它点所成的角度普遍偏小且波动范围小、常规点与其它点所成的角度有大有小且波动范围大，HP Kriegel等人提出了基于角度的异常检测方法ABOD，ABOD方法将角度的方差作为度量数据点异常程度的异常因子ABOF，ABOD方法在高维空间中仍然有很高的检测准确率；Ye H提出了基于角度的数据流异常检测方法DSABOD，随着数据流数据点不断流入内存，DSABOD方法动态更新每个数据点相对于其邻域点的异常因子，DSABOD方法为高维数据流中的异常检测提出了一种新的思路，但传统的基于角度的数据流异常检测方法存在异常检测率较低的问题。基于聚类的数据流异常检测方法分为对数据点进行聚类和对每个簇中数据点进行异常检测这两个阶段，Elahi M等人提出了一种基于聚类的数据流异常检测方法，将K-Means和LOF相结合的方法，该方法分区域定义异常因子，提高了方法对异常检测的准确率；Thakran Y等人又提出了将DBSCAN方法与W-K-Means方法相结合的方法，该方法对当前时刻的数据块运用DBSCAN方法进行聚类并得到候选异常点和初始簇，该方法结合前一时刻得到的待多重验证的候选异常点,运用W-K-Means方法再次进行聚类，得到当前时刻的候选异常点和常规点簇，同时该方法对候选异常点采用多重验证删除误判的异常点释放内存，该方法在整个过程中动态调整DBSCAN方法所需要的参量MinPts、Epsilon、W-K-Means方法的属性权重，该方法对异常检测的准确度较高，但需要的人为设定参量过多，人工干预严重，方法的复杂度较高，且该方法在多维空间中的有效性较差。In view of the practical problems faced at different stages, scientific and technological workers have proposed different data flow anomaly detection methods. Existing data stream anomaly detection methods can be roughly divided into density-based data stream anomaly detection methods, angle-based data stream anomaly detection methods, and cluster-based data stream anomaly detection methods. The density-based data stream anomaly detection method uses density as the most basic anomaly measurement method, and constructs an anomaly factor that can be dynamically updated to measure the degree of data anomaly. Pokrajac et al. introduced the static data anomaly detection method LOF to the data stream. , developed an incremental local anomaly detection method INCLOF that can be applied to dynamic data streams. With the insertion of new data, INCLOF deletes historical data and dynamically updates the anomaly factor of each data point; Ke Gao et al. Improvements are made, the idea of sliding window is introduced, and the n-INCLOF method is proposed. The n-INCLOF method only updates the abnormal factor of each data object in the sliding window at the current moment; in some cases, some data points appear as Abnormal, but does not behave abnormally at the next moment. Based on this problem, Karimian S H et al. proposed the I-IncLOF method. The I-IncLOF method introduced the idea of multiple verifications. The I-IncLOF method will be in the whole sliding process of the window The data objects that always appear abnormal are determined to be abnormal points. The I-IncLOF method greatly reduces the false positive rate, but the I-IncLOF method is less effective in multi-dimensional situations; XinjieLu et al. proposed the INCLOCI method, and INCLOCI introduced multi-granularity The anomaly factor MDEF, INCLOCI method can not only detect scattered outliers, but also detect outlier clusters. In order to solve the problem that the similarity measurement methods such as distance and density are less effective in high-dimensional data space, some researchers have proposed angle measurement methods. The basic idea of angle similarity measurement is that abnormal points and other points are formed. The angle is generally small and the fluctuation range is small. The angle formed by the conventional point and other points is large or small and the fluctuation range is large. HP Kriegel et al. proposed an angle-based anomaly detection method ABOD. The ABOD method uses the variance of the angle as the measurement data. The anomaly factor ABOF of the point anomaly degree, the ABOD method still has a high detection accuracy in high-dimensional space; Ye H proposed an angle-based data flow anomaly detection method DSABOD, as the data flow data points continue to flow into the memory, the DSABOD method Dynamically update the anomaly factor of each data point relative to its neighbor points. The DSABOD method proposes a new idea for anomaly detection in high-dimensional data streams, but the traditional angle-based data stream anomaly detection methods have anomaly detection rates. lower problem. Clustering-based data flow anomaly detection method is divided into two stages: clustering data points and anomaly detection of data points in each cluster. Elahi M et al. proposed a clustering-based data flow anomaly detection method. , the method of combining K-Means and LOF, this method defines anomaly factors in different regions, which improves the accuracy of the method for anomaly detection; Thakran Y et al. proposed a method that combines the DBSCAN method with the W-K-Means method, This method uses the DBSCAN method to cluster the data blocks at the current moment and obtains candidate outliers and initial clusters. The method combines the candidate outliers obtained at the previous moment to be multi-validated, and uses the W-K-Means method to cluster again, and obtains The candidate outliers and regular point clusters at the current moment, at the same time, the method uses multiple verifications for the candidate outliers to delete the misjudged outliers to release the memory, and the method dynamically adjusts the parameters MinPts, Epsilon, W-K- The attribute weight of the Means method. This method has high accuracy for anomaly detection, but it requires too many artificial parameters, serious manual intervention, high complexity of the method, and poor effectiveness in multi-dimensional space. .

数据流异常检测是现今数据挖掘领域中的研究热点和难点，主要目标是从动态变化的复杂数据环境中实时准确的检测出与常规模式不相符的信息。Data stream anomaly detection is a research hotspot and a difficult point in the field of data mining.

发明内容SUMMARY OF THE INVENTION

针对传统方法的时间复杂度高、内存占用大且使用效率低、人为参数干预过多以及多维数据环境下有效性低等问题，本发明提供了一种基于增强型角度异常因子的数据流异常检测及多重验证的方法。这种方法能降低内存的占用率，实时性好、异常检测准确率高、时间复杂度低。Aiming at the problems of high time complexity, large memory occupation, low use efficiency, excessive human parameter intervention, and low effectiveness in multi-dimensional data environment of traditional methods, the present invention provides a data flow anomaly detection based on an enhanced angle anomaly factor and multiple authentication methods. This method can reduce the occupancy rate of memory, has good real-time performance, high anomaly detection accuracy, and low time complexity.

实现本发明目的的技术方案是：The technical scheme that realizes the object of the present invention is:

一种基于增强型角度异常因子的数据流异常检测及多重验证的方法，包括如下步骤：A method for data flow anomaly detection and multiple verification based on an enhanced angle anomaly factor, comprising the following steps:

1)对实时数据流进行处理：对数据采集终端采集的各式各样的实时数据流，按照构成数据的最小单位和采集时的时间顺序，将一个时间段获得的数据组成数据块，将几个数据块构成滑动窗口处理的数据集S，为步骤2)处理做准备；1) Process the real-time data stream: For various real-time data streams collected by the data collection terminal, according to the smallest unit of data and the time sequence of collection, the data obtained in a period of time is formed into data blocks, and several The data blocks constitute the data set S processed by the sliding window, and prepare for the processing in step 2);

2)对所述步骤1)处理后得到的结果进行设置得到当前滑动窗口中数据集S，设S＝{X₁,X₂,...,X_n}，其中包含n个数据点，每个数据点根据其属性表示为

其中

表示数据点x_i有d个属性，用于后续聚类和异常检测；2) Set the result obtained after the processing in step 1) to obtain the data set S in the current sliding window, set S={X ₁ , X ₂ ,..., X _n }, which contains n data points, each data points are represented according to their properties as

in

Indicates that the data point x _i has d attributes for subsequent clustering and anomaly detection;

3)初始化参数k、r、ξ：设定初始化参数，其中k表示数据点的k个最近邻点的个数，r为数据点的空间邻域半径，ξ为异常判决阈值调整系数，异常判决阈值θ＝μ+ξ·δ，其中μ和δ对应所有数据点增强型角度异常因子的均值及标准差；3) Initialization parameters k, r, ξ: set the initialization parameters, where k represents the number of k nearest neighbors of the data point, r is the radius of the spatial neighborhood of the data point, ξ is the adjustment coefficient of abnormal judgment threshold, abnormal judgment Threshold θ=μ+ξ·δ, where μ and δ correspond to the mean and standard deviation of the enhanced angle anomaly factor of all data points;

4)获取距离矩阵dist：结合步骤2)中的数据集S，计算所有数据点之间的距离，得到n×n的距离矩阵dist，dist＝[d_ij]_n×n，计算公式为公式(1)：4) Obtain the distance matrix dist: Combine the data set S in step 2), calculate the distance between all data points, and obtain the distance matrix dist of n×n, dist=[d _ij ] _n×n , the calculation formula is the formula ( 1):

其中X_i和Y_j都是集合S中的数据点，并且x_ik表示数据点x_i有k个属性，y_jk表示数据点y_j有k个属性；where X _i and Y _j are both data points in the set S, and x _ik indicates that the data point x _i has k attributes, and y _jk indicates that the data point y _j has k attributes;

5)得到r邻域点集合：根据空间邻域半径r，得到每个数据点的r邻域点集合即该点处以邻域半径r为半径，圈住的所有数据点的集合；5) Obtain the r neighborhood point set: according to the spatial neighborhood radius r, obtain the r neighborhood point set of each data point, that is, the set of all data points circled at the point with the neighborhood radius r as the radius;

6)得到r邻域点集合的角度因子

和局部密度

结合距离矩阵dist得到r邻域点集合的角度因子

和r邻域点集合的局部密度

其中N_r数据点x_i的r邻域；6) Get the angle factor of the r neighborhood point set

and local density

Combine the distance matrix dist to get the angle factor of the r neighborhood point set

and r the local density of the set of neighbor points

where N _r the r neighborhood of data point x _i ;

7)获取相异度δ(x_i)：根据步骤6)中得到的r邻域点集合的局部密度

进行排序后，计算对应的相异度δ(x_i)；7) Obtain the dissimilarity δ(x _i ): according to the local density of the r neighborhood point set obtained in step 6)

After sorting, calculate the corresponding dissimilarity δ(x _i );

8)获取各数据点的簇心因子τ(x_i)：结合步骤6)和步骤7)获取簇心因子τ(x_i),计算公式为公式(5)：

簇心因子τ(x_i)用来度量数据点处于簇心的程度；8) Obtain the cluster center factor τ(x _i ) of each data point: Combine step 6) and step 7) to obtain the cluster center factor τ(x _i ), and the calculation formula is formula (5):

The cluster center factor τ(x _i ) is used to measure the degree to which the data points are located in the cluster center;

9)获取归属矩阵：对步骤8)中得到的所有数据点簇心因子，进行降序排序得τ(p₁)≥τ(p₂)≥…≥τ(p_n)，其中p_n表示对应数据点的原始序号，从而得到用于聚类的归属矩阵F＝[f₁,f₂,...,f_n]；9) Obtain the attribution matrix: sort all the data point cluster center factors obtained in step 8) in descending order to obtain τ(p ₁ )≥τ(p ₂ )≥...≥τ(p _n ), where _pn represents the corresponding data The original sequence number of the point, so as to obtain the attribution matrix F=[f ₁ , f ₂ ,..., f _n ] for clustering;

10)确定簇心并聚类：利用簇心因子和归属矩阵对数据集S进行簇心确定及聚类，将所有类标号相同的数据点组成集合，即簇，得到m＝C_{center_id}个簇C₁,C₂,...,C_m，其中C_{center_id}为簇心序号，完成对数据集S的聚类；10) Determine the cluster center and cluster: use the cluster center factor and the attribution matrix to determine and cluster the data set S, and form a set of all data points with the same class label, that is, a cluster, and obtain m=C _{center_id} clusters C ₁ ,C ₂ ,...,C _m , where C _{center_id} is the cluster center number, and the clustering of the data set S is completed;

11)对聚类后的各簇分别进行异常检测：步骤10)中得到各个簇C_i(i＝1,2,…,m)，先对聚类后的数据集S中的各个簇C₁,C₂,...,C_m，分别进行异常检测，得到各簇异常点集合O_i，最终得到数据集S中所有异常点集合O＝{O₁,...,O_m}，异常检测所涉及的公式有：簇内角度因子

为公式(7)：11) Perform anomaly detection on each clustered cluster: in step 10), each cluster C _i (i=1, 2, ..., m) is obtained, first, each cluster C ₁ in the clustered data set S is ,C ₂ ,...,C _m , carry out anomaly detection respectively, obtain each cluster outlier set O _i , and finally obtain all outlier sets in the data set S O={O ₁ ,...,O _m }, anomaly The formulas involved in the detection are: In-cluster angle factor

is formula (7):

其中A、B、C是数据集中的数据点，

阿表示第个i簇且簇中每个数据点均有d维属性，

是数据点q邻域内的数据点个数，where A, B, C are the data points in the dataset,

A represents the i-th cluster and each data point in the cluster has a d-dimensional attribute,

is the number of data points in the neighborhood of data point q,

局部增量值H(X_j)为公式(8)：

The local increment value H(X _j ) is formula (8):

k个最近邻点的距离和L(X_j)为公式(9)：The distance sum L(X _j ) of the k nearest neighbors is formula (9):

其中，

表示数据点X_j在其所属簇中的k个最近邻点组成的k邻域；in,

Represents the k neighborhood of the k nearest neighbors of the data point X _j in the cluster to which it belongs;

增强型角度异常因子EAOF(X_j)为公式(10)：The enhanced angular anomaly factor EAOF(X _j ) is formula (10):

其中，o是数据点X_j所属簇的簇心，dist(o,X_j)为数据点X_j与其簇心的距离，

表示簇C_i(i＝1,2,…,m)内每个数据点相对于该簇的角度因子，H(X_j)是局部增量值；Among them, o is the cluster center of the cluster to which the data point X _j belongs, dist(o, X _j ) is the distance between the data point X _j and its cluster center,

represents the angle factor of each data point in the cluster C _i (i=1,2,...,m) relative to the cluster, H(X _j ) is the local increment value;

12)多重验证：对所有候选异常点进行多次验证，并将经过有限次验证仍表现为异常的候选异常点判决为确定异常点并输出保存，若在验证过程中表现为正常点，则将该异常点直接丢弃。12) Multiple verification: Perform multiple verifications on all candidate abnormal points, and judge the candidate abnormal points that are still abnormal after limited verification as the determined abnormal points and output and save them. If they are normal points during the verification process, they will be The outlier is directly discarded.

步骤1)中所述的处理是指数据采集终端采集的数据以流的形式缓存，并将缓存的数据划分成E₀,E₁,E₂,......数据块，每一个数据块代表一个基础窗口，每个滑动窗口W包含ε＝2个基本窗口，采用基础窗口和滑动窗口相结合，实现数据的插入和删除，基础窗口和滑动窗口相结合的过程为：在T_i时刻过渡到T_i+1时刻，滑动窗口由W_i滑到W_i+1，伴随着新基础窗口E_i+1的并入和历史基础窗口E_i-1的移除，同时，将T_i时刻W_i检测的候选异常点并入到W_i+1中进行多重验证。The processing described in step 1) means that the data collected by the data collection terminal is buffered in the form of a stream, and the buffered data is divided into E ₀ , E ₁ , E ₂ , . . . data blocks, each data block The block represents a basic window, and each sliding window W contains ε=2 basic windows. The combination of the basic window and the sliding window is used to realize the insertion and deletion of data. The process of combining the basic window and the sliding window is: At time T _i Transitioning to time T _i+1 , the sliding window is slid from Wi to Wi ₊₁ , along with the merging of the new base window E _i ₊₁ and the removal of the historical base window E _i-1 , at the same time, the time T _i is Candidate _outliers detected by Wi are incorporated into Wi ₊₁ for multiple validation.

步骤6)中所述的r邻域点集合的角度因子计算公式为公式(2)：The calculation formula of the angle factor of the r neighborhood point set described in step 6) is formula (2):

步骤6)中所述的r邻域点集合的局部密度计算公式为公式(3)：The formula for calculating the local density of the r neighborhood point set described in step 6) is formula (3):

其中N_r(p)是数据点p的r邻域，q是数据点p的r邻域集合中的任意一个数据点，局部密度与邻域数据点数以及所处的位置有关，邻域数据点数越多、越处于数据集的中心，局部密度就越大。where N _r (p) is the r neighborhood of the data point p, q is any data point in the r neighborhood set of the data point p, the local density is related to the number of neighborhood data points and their location, the number of neighborhood data points The more and the more in the center of the dataset, the greater the local density.

步骤7)中所述的相异度δ(x_i)是对所有数据点的局部密度进行降序排序后得到，相异度δ(x_i)的计算公式为公式(4)：The dissimilarity δ(x _i ) described in step 7) is obtained by sorting the local densities of all data points in descending order, and the calculation formula of the dissimilarity δ(x _i ) is formula (4):

其中p_i和p_j是对应数据点的序号，当i＝1时，j≥2；当i≥2时，j＜i。Where p _i and p _j are the serial numbers of the corresponding data points, when i=1, j≥2; when i≥2, j<i.

步骤9)中所述的归属矩阵F＝[f₁,f₂,...,f_n]，是用于记录数据点之间归属关系，各元素的表示公式为公式(6)：The attribution matrix F=[f ₁ , _f ₂ , .

其中，{p_i}表示簇心因子τ(x_i)降序排序后的原始下标序号。Among them, {pi _} represents the original index number after the cluster center factor τ(x _i ) is sorted in descending order.

这种数据流异常检测方法，分为2个过程，即数据流处理过程和数据流异常检测过程。数据流处理过程是将动态的数据流转化为静态数据块，方便后续异常检测，同时又要保证整个检测的实时性和高效性；数据流异常检测过程是对经数据流处理过程处理好之后的静态数据集进行异常检测，为了提高异常检测的准确率，采用了先聚类后异常检测的方法。在本技术方案中，滑动窗口和基础窗口结合的实时数据流处理方法，是数据流处理过程的核心，其在减少内存占用率的同时，提高了后续异常检测的质量，簇心因子和归属矩阵是本技术方案中新引进用于确定簇心和聚类的两个参数，能快速有效的确定多维数据空间的簇心，并根据确定的簇心准确聚类；增强型角度异常因子是本技术方案中另一个重要的参数，其弥补了传统异常因子的部分缺陷，保留了角度度量方式在多维空间中的有效性，是异常检测部分的核心。This data stream anomaly detection method is divided into two processes, namely, a data stream processing process and a data stream anomaly detection process. The data stream processing process is to convert the dynamic data stream into static data blocks, which is convenient for subsequent abnormal detection, and at the same time ensures the real-time and high efficiency of the whole detection; the data stream abnormal detection process is processed after the data stream processing process. Anomaly detection is performed on static data sets. In order to improve the accuracy of anomaly detection, a method of clustering first and then anomaly detection is adopted. In this technical solution, the real-time data stream processing method combining the sliding window and the basic window is the core of the data stream processing process, which not only reduces the memory occupancy rate, but also improves the quality of subsequent anomaly detection, cluster center factor and attribution matrix. It is two parameters newly introduced in this technical solution for determining the cluster center and clustering, which can quickly and effectively determine the cluster center of the multi-dimensional data space, and accurately cluster according to the determined cluster center; the enhanced angle anomaly factor is the technology of this technology. Another important parameter in the scheme, which makes up for some of the defects of the traditional anomaly factor, retains the validity of the angle measurement method in the multi-dimensional space, and is the core of the anomaly detection part.

这种方法运用了滑动窗口和基本窗口技术，构造了高效的数据流处理模型，降低了内存的占用率、实时性好、异常检测准确率高、时间复杂度低。This method uses sliding window and basic window technology to construct an efficient data stream processing model, which reduces the memory usage rate, has good real-time performance, high anomaly detection accuracy, and low time complexity.

附图说明Description of drawings

图1为实施例中方法流程示意图；1 is a schematic flowchart of a method in an embodiment;

图2为实施例中t₁时刻滑动窗口中数据点分布图的示意图；Fig. 2 is the schematic diagram of the distribution map of data points in the sliding window at time _t1 in the embodiment;

图3为实施例中t₂时刻滑动窗口中数据点分布图的示意图；Fig. 3 is the schematic diagram of the distribution map of data points in the sliding window at time t ₂ in the embodiment;

图4为实施例中滑动窗口和基础窗口结合处理实时数据流及多重验证过程示意图；4 is a schematic diagram of a sliding window and a base window combined to process real-time data streams and multiple verification processes in the embodiment;

图5为实施例中数据点角度度量示意图；Fig. 5 is a schematic diagram of data point angle measurement in an embodiment;

图6为实施例中U型簇数据基于传统角度度量方法误判数据点分布示意图；6 is a schematic diagram of the misjudged data point distribution of U-shaped cluster data based on a traditional angle measurement method in an embodiment;

图7为实施例中多簇数据基于传统角度度量方法误判数据点分布示意图；7 is a schematic diagram of the distribution of misjudged data points based on traditional angle measurement methods for multi-cluster data in an embodiment;

图8为实施例中数据集原始坐标分布示意图；8 is a schematic diagram of the original coordinate distribution of the dataset in the embodiment;

图9为实施例中局部密度-相异度分布示意图；9 is a schematic diagram of local density-dissimilarity distribution in an embodiment;

图10为实施例中簇心因子排序分布示意图；10 is a schematic diagram of the cluster center factor ranking distribution in the embodiment;

图11a为实施例中数据集1分布示意图；11a is a schematic diagram of the distribution of dataset 1 in the embodiment;

图11b为实施例中数据集1异常点分布示意图；11b is a schematic diagram of the distribution of abnormal points in dataset 1 in the embodiment;

图11c为实施例中数据集1异常检测检测出的异常点标识示意图；11c is a schematic diagram of anomalous point identification detected by anomaly detection of dataset 1 in the embodiment;

图11d为实施例中数据集1异常检测将正常点误检为异常点标识示意图；11d is a schematic diagram of anomaly detection of data set 1 in an embodiment where a normal point is mistakenly detected as an abnormal point identification;

图12a为实施例中数据集2分布示意图；Figure 12a is a schematic diagram of the distribution of dataset 2 in the embodiment;

图12b为实施例中数据集2实际异常分布示意图；Figure 12b is a schematic diagram of the actual abnormal distribution of data set 2 in the embodiment;

图12c为实施例中数据集2异常检测检测出的异常点标识示意图；12c is a schematic diagram of anomalous point identification detected by anomaly detection of dataset 2 in the embodiment;

图12d为实施例中数据集2异常检测将异常点检测为正常点标识示意图。FIG. 12d is a schematic diagram showing the identification of abnormal points detected as normal points by abnormality detection of dataset 2 in the embodiment.

具体实施方式Detailed ways

下面结合附图和实施例对本发明内容作进一步的阐述，但不是对本发明的限定。The content of the present invention will be further described below with reference to the accompanying drawings and embodiments, but it is not intended to limit the present invention.

参照图1，一种基于增强型角度异常因子的数据流异常检测及多重验证的方法，包括如下步骤：Referring to FIG. 1, a method for abnormality detection and multiple verification of data flow based on an enhanced angle abnormality factor, comprising the following steps:

1)对实时数据流进行处理：对数据采集终端采集的各式各样的实时数据流进行处理，实时数据流具有动态多变特征，有的数据对象在当前滑动窗口中表现为异常，但在下一时刻的滑动窗口中却表现为正常点，如图2和图3所示，图2为t₁时刻滑动窗口数据点的分布图，此时的P′点表现为异常，但随着数据点持续地流入，P′点周围积累了越来越多的数据点，图3为t₂时刻滑动窗口数据点的分布图，可以看出此时P′点表现为正常；1) Process real-time data streams: Process various real-time data streams collected by the data acquisition terminal. The real-time data streams have dynamic and changeable characteristics. Some data objects appear abnormal in the current sliding window, but in the following The sliding window at a moment appears to be a normal point, as shown in Figure 2 and Figure 3, Figure 2 is the distribution diagram of the sliding window data points at time _t1 , the P' point at this time is abnormal, but with the data points Continuous inflow, more and more data points are accumulated around the P' point. Figure 3 is the distribution diagram of the sliding window data points at time t _2. It can be seen that the P' point is normal at this time;

2)设置滑动窗口中数据集S：步骤1)处理得到当前滑动窗口中数据集S：设S＝{X₁,X₂,...,X_n}，其中包含n个数据点，每个数据点根据其属性表示为

用于后续聚类和异常检测；2) Set the data set S in the sliding window: Step 1) Process to obtain the data set S in the current sliding window: set S = {X ₁ , X ₂ ,..., X _n }, which contains n data points, each Data points are represented according to their properties as

For subsequent clustering and anomaly detection;

3)初始化参数k、r、ξ：设定初始化参数，其中k表示数据点的k个最近邻个数，r为数据点的空间邻域半径，ξ为异常判决阈值调整系数，异常判决阈值θ＝μ+ξ·δ，其中μ和δ对应所有数据点增强型角度异常因子的均值及标准差；3) Initialization parameters k, r, ξ: set the initialization parameters, where k represents the number of k nearest neighbors of the data point, r is the radius of the spatial neighborhood of the data point, ξ is the adjustment coefficient of the abnormal judgment threshold, and the abnormal judgment threshold θ =μ+ξ·δ, where μ and δ correspond to the mean and standard deviation of the enhanced angle anomaly factor of all data points;

其中X_i和Y_j都是集合S中的数据点，并且x_ik表示数据点x_i有k个属性，y_jk表示数据点y_j有k个属性；

where X _i and Y _j are both data points in the set S, and x _ik indicates that the data point x _i has k attributes, and y _jk indicates that the data point y _j has k attributes;

6)得到r邻域点集合的角度因子

和局部密度

结合距离矩阵dist得到r邻域点集合的角度因子

和r邻域点集合的局部密度

其中N_r数据点x_i的r邻域；如图5所示，是基于角度的度量思想，计算该数据点与其它每一对数据点所成的角度，再取方差，对于核心区域点A₁，它与其它点对所成的角度变化范围很大，故方差大；对于异常点A₃，它与其它点对所成的角度变化范围很小，故方差小；而对于边界点A₂，它与其它点对所成的角度变化范围介于A₁和A₃变化范围之间，故方差也介于核心区域点与异常点之间，但这样存在一些缺陷，如图6和图7所示，图6中的异常点B₁位于U形簇的中心，它与周围点对所成的角度变化范围大，即方差较大，而边缘点B₂与其它点对所成的角度变化范围小，即方差较小；同样的，图7中的异常点D₁位于两个簇的中间，该点与两簇间的点对所成的角度变化范围大，而边缘点D₂与其它点对所成的角度变化范围较小；这将使得到的结果与实际结果刚好相反，出现漏、误判；6) Get the angle factor of the r neighborhood point set

and local density

and r the local density of the set of neighbor points

Among them, N _r is the r neighborhood of the data point _xi ; as shown in Figure 5, it is based on the angle measurement idea, calculate the angle formed by the data point and each other pair of data points, and then take the variance, for the core area point A ₁ , the angle formed by it and other point pairs has a large variation range, so the variance is large; for the abnormal point A ₃ , the angle formed by it and other point pairs has a small variation range, so the variance is small; and for the boundary point A ₂ , the variation range of the angle formed by it and other point pairs is between the variation range of A ₁ and A ₃ , so the variance is also between the core area point and the abnormal point, but there are some defects, as shown in Figure 6 and Figure 7 As shown, the abnormal point B1 in Figure ₆ is located in the center of the U-shaped cluster, and the angle formed by it and the surrounding point pairs has a large variation range, that is, the variance is large, while the angle formed by the edge point _B2 and other point pairs varies The range is small, that is, the variance is small; similarly, the abnormal point D ₁ in Figure 7 is located in the middle of the two clusters, and the angle formed by this point and the point pair between the two clusters has a large variation range, while the edge point D ₂ and other The variation range of the angle formed by the point pair is small; this will make the obtained result just opposite to the actual result, and there will be leakage and misjudgment;

After sorting, calculate the corresponding dissimilarity δ(x _i );

8)获取各数据点的簇心因子τ(x_i)：结合步骤6)和步骤7)获取簇心因子τ(x_i)，计算公式为公式(5)：

簇心因子τ(x_i)用来度量数据点处于簇心的程度，簇心因子是实施例方法中，用于快速有效确定多维数据空间簇心的改进参数因子，是聚类中至关重要的一步，实现过程如图8、图9、图10所示，可以看出该数据集由两个簇构成，其中点13和点25分别为两个簇的簇心；图9为通过公式(3)和公式(4)所得到的该数据集中各点的ρ-δ(局部密度-相异度)分布示意图，可以看出点13和点25的局部密度和相异度都较大；图10为通过公式(5)所得到的各点的簇心因子降序排序后的分布示意图，可以看出点13和点25的簇心因子最大，因此最有可能为簇心；8) Obtain the cluster center factor τ(x _i ) of each data point: Combine step 6) and step 7) to obtain the cluster center factor τ(x _i ), and the calculation formula is formula (5):

The cluster center factor τ(x _i ) is used to measure the degree to which the data points are located in the cluster center. The cluster center factor is an improved parameter factor used to quickly and effectively determine the cluster center in the multidimensional data space in the embodiment method, and is an important factor in clustering. The implementation process is shown in Figure 8, Figure 9, and Figure 10. It can be seen that the data set is composed of two clusters, in which point 13 and point 25 are the cluster centers of the two clusters; 3) The schematic diagram of the ρ-δ (local density-dissimilarity) distribution of each point in the data set obtained by formula (4), it can be seen that the local density and dissimilarity of point 13 and point 25 are large; Fig. 10 is a schematic diagram of the distribution of the cluster centroid factors of each point obtained by formula (5) in descending order, it can be seen that the cluster centroid factor of point 13 and point 25 is the largest, so it is most likely to be the cluster centroid;

10)确定簇心并聚类：利用簇心因子和归属矩阵对数据集S进行簇心确定及聚类，将所有类标号相同的数据点组成集合，即簇，得到m(m＝C_{center_id})个簇C₁,C₂,...,C_m，完成对数据集S的聚类；10) Determine the cluster center and cluster: use the cluster center factor and the attribution matrix to determine and cluster the data set S, and form a set of all data points with the same class label, that is, a cluster, to obtain m (m=C _{center_id} ) cluster C ₁ , C ₂ ,...,C _m to complete the clustering of the dataset S;

is formula (7):

其中A、B、C是数据集中的数据点，

阿表示第个i簇且簇中每个数据点均有d维属性，

是数据点q邻域内的数据点个数，局部增量值H(X_j)为公式(8)：

where A, B, C are the data points in the dataset,

is the number of data points in the neighborhood of data point q, and the local increment value H(X _j ) is formula (8):

其中，

表示数据点X_j在其所属簇中的k个最近邻点组成的k邻域；in,

表示C_i(i＝1,2,…,m)簇内每个数据点相对于该簇的角度因子，H(X_j)是局部增量值；Among them, o is the cluster center of the cluster to which the data point X _j belongs, dist(o, X _j ) is the distance between the data point X _j and its cluster center,

Represents the angle factor of each data point in the cluster C _i (i=1,2,...,m) relative to the cluster, and H(X _j ) is the local increment value;

12)多重验证：对所有候选异常点进行多次验证，并将经过有限次验证仍表现为异常的候选异常点判决为确定异常点并输出保存，若在验证过程中表现为正常点，则将该异常点直接丢弃，这样做能增加异常检测准确率的效果。12) Multiple verification: Perform multiple verifications on all candidate abnormal points, and judge the candidate abnormal points that are still abnormal after limited verification as the determined abnormal points and output and save them. If they are normal points during the verification process, they will be The anomaly point is directly discarded, which can increase the effect of anomaly detection accuracy.

步骤1)中所述的处理是指数据采集终端采集的数据以流的形式缓存，并将缓存的数据划分成E₀,E₁,E₂,......数据块，每一个数据块代表一个基础窗口，每个滑动窗口W包含ε＝2个基本窗口，采用基础窗口和滑动窗口相结合，实现数据的插入和删除，基础窗口和滑动窗口相结合的过程如图4所示：在T_i时刻过渡到T_i+1时刻，滑动窗口由W_i滑到W_i+1，伴随着新基础窗口E_i+1的并入和历史基础窗口E_i-1的移除，同时，将T_i时刻W_i检测的候选异常点并入到W_i+1中进行多重验证。The processing described in step 1) means that the data collected by the data collection terminal is buffered in the form of a stream, and the buffered data is divided into E ₀ , E ₁ , E ₂ , . . . data blocks, each data block The block represents a basic window, and each sliding window W contains ε=2 basic windows. The combination of the basic window and the sliding window is used to realize the insertion and deletion of data. The process of combining the basic window and the sliding window is shown in Figure 4: At the transition from time T _i to time T _i ₊₁ , the sliding window is slid from Wi to Wi ₊₁ , along with the merging of the new base window E _i+1 and the removal of the historical base window E _i-1 , at the same time, The candidate outliers detected by Wi at time T _i are incorporated into Wi ₊₁ _for multiple verification.

其中N_r(p)是数据点p的r邻域，q是数据点p的r邻域集合中的任意一个数据点，局部密度与邻域数据点数以及所处的位置有关，邻域数据点数越多、越处于数据集的中心，局部密度就越大。

where N _r (p) is the r neighborhood of the data point p, q is any data point in the r neighborhood set of the data point p, the local density is related to the number of neighborhood data points and their location, the number of neighborhood data points The more and the more in the center of the dataset, the greater the local density.

步骤7)中所述的相异度δ(x_i)是对所有数据点的局部密度进行降序排序后得到，相异度δ(x_i)的计算公式为公式(4)：相异度是数据点间属于不同簇的可能性大小的度量，由给定的数据集S，根据步骤6)得到的局部密度，经降序排序后得到

其中，{p_i}表示局部密度

的一个降序后原始下标序号，d(p_i,p_j)表示数据点p_i与p_j间的欧氏距离，则某一数据点p_i的相异度可定义如下：The dissimilarity δ(x _i ) described in step 7) is obtained by sorting the local densities of all data points in descending order, and the calculation formula of the dissimilarity δ(x _i ) is formula (4): the dissimilarity is A measure of the likelihood that the data points belong to different clusters, from the given data set S, according to the local density obtained in step 6), obtained after sorting in descending order

where {pi _} represents the local density

A descending original index number of , d(pi , p _j ) represents the Euclidean distance between data points p _i and p _j _{, then the dissimilarity of a data point p i} _can be defined as follows:

其中p_i和p_j是对应数据点的序号，当i＝1时，j≥2；当i≥2时，j＜i。

Where p _i and p _j are the serial numbers of the corresponding data points, when i=1, j≥2; when i≥2, j<i.

步骤10)中所述的确定簇心并聚类是指先定义簇心序号为C_{center_id}，数据点的类标号为C_{cluster_label}，并初始化簇心序号为1，即C_{center_id}＝1；同时将步骤8)中获取的簇心因子最大的数据点的类标号为1，即

再根据步骤8)中得到的降序下标序号{p_i}，对整个数据集S进行条件遍历，若

与

所有点的距离均满足

(其中，r是初始参数值邻域半径)，则将该点重新定义为一个新的簇心，该点类标号增加1，据此得到所有的簇心；然后根据得到的簇心，再利用步骤9)中的归属矩阵F＝[f₁,f₂,...,f_n]，对隶属于同一簇心的点贴上同一标签(即类标号)，方法如下：通过步骤9)中得到的降序下标序号{p_i}，对整个数据集S进行条件遍历，如果p_i非簇心，根据归属矩阵将

对应的标签赋给p_i，否则p_i的标签就是其本身，最终将所有类标号相同的数据点组成集合，即簇，得到m(m＝C_{center_id})个簇C₁,C₂,...,C_m，完成对数据集S的聚类。Determining the cluster center and clustering as described in step 10) refers to first defining the cluster center sequence number as C _{center_id} , the class label of the data point as C _{cluster_label} , and initializing the cluster center sequence number as 1, that is, C _{center_id} = 1; ), the class label of the data point with the largest cluster centroid factor is 1, that is,

Then, according to the descending subscript number {pi _} obtained in step 8), conditionally traverse the entire data set S, if

and

The distances of all points satisfy

(where r is the neighborhood radius of the initial parameter value), then redefine the point as a new cluster center, increase the point class label by 1, and obtain all the cluster centers accordingly; and then use the obtained cluster center to use The attribution matrix F=[f ₁ , _f ₂ , . The obtained descending subscript number {pi _} is used to conditionally traverse the entire data set S. If _pi is not a cluster center, according to the attribution matrix,

The corresponding label is assigned to p _i , otherwise the label of p _i is itself, and finally all data points with the same class label are formed into a set, that is, a cluster, and m (m=C _{center_id} ) clusters C ₁ , C ₂ , .. .,C _m , completes the clustering of the dataset S.

步骤11)中所述是对聚类后的各个簇分别进行异常检测，异常检测的具体步骤如下：As described in step 11), anomaly detection is performed on each cluster after clustering, and the specific steps of anomaly detection are as follows:

①对于任意簇C_i(i＝1,2,…,m)，计算该簇内每个数据点相对于该簇的角度因子①For any cluster C _i (i=1,2,...,m), calculate the angle factor of each data point in the cluster relative to the cluster

为公式(7)：

is formula (7):

其中，C_i(i＝1,2,…,m)表示聚类后任意簇；Among them, C _i (i=1,2,...,m) represents any cluster after clustering;

②计算簇中每个数据点相对于其空间r邻域内的局部增量值H(X_j)为公式(8)：② Calculate the local increment value H(X _j ) of each data point in the cluster relative to its spatial r neighborhood as formula (8):

局部增量是为了反映数据点在其所属簇空间邻域内数据点的密集程度，其中，

表示了数据点X_j在其所属簇的r邻域

内的数据点数The local increment is to reflect the density of data points in the spatial neighborhood of the cluster to which they belong, where,

represents the r neighborhood of the data point X _j in the cluster to which it belongs

number of data points within

③根据步骤10)中确认的簇心，计算各数据点到所属簇的簇心之间的距离dist(o,X_j)；③ According to the cluster center confirmed in step 10), calculate the distance dist(o, X _j ) between each data point and the cluster center of the cluster to which it belongs;

④计算每个数据点与其k个最近邻点的距离和L(X_j)为公式(9)：④ Calculate the distance and L(X _j ) of each data point and its k nearest neighbors as formula (9):

其中，

表示数据点X_j在其所属簇中的k个最近邻点组成的k邻域，k个最近邻点的距离和L(X_j)反映了数据点与周围数据点远近程度，这样是为了避免基于角度的异常因子出现类似于图6中B₁存在的缺陷；in,

Indicates the k neighborhood of the k nearest neighbors of the data point X _j in the cluster to which it belongs. The distance and L(X _j ) of the k nearest neighbors reflect the distance between the data point and the surrounding data points. This is to avoid The angle-based anomaly factor appears similar to the defect in B ₁ in Figure 6;

⑤计算每个数据点的增强型角度异常因子EAOF(X_j)为公式(10)：⑤ Calculate the enhanced angular anomaly factor EAOF(X _j ) for each data point as formula (10):

表示C_i(i＝1,2,…,m)簇内每个数据点相对于该簇的角度因子，H(X_j)是局部增量值；增强型角度异常因子EAOF不仅具备角度度量方式在多维空间中优越的度量性能，而且引入了距离和密度的思想，弥补了传统的基于角度异常因子的缺陷；Among them, o is the cluster center of the cluster to which the data point X _j belongs, dist(o, X _j ) is the distance between the data point X _j and its cluster center,

Represents the angle factor of each data point in the C _i (i=1,2,...,m) cluster relative to the cluster, and H(X _j ) is the local incremental value; the enhanced angle anomaly factor EAOF not only has an angle measurement method Excellent measurement performance in multi-dimensional space, and the idea of distance and density is introduced, which makes up for the defects of traditional angle-based anomaly factors;

⑥由⑤中得到的所有数据点增强型角度异常因子，计算其均值μ及标准差δ，并用均值和标准差计算异常判决阈值θ，θ＝μ+ξ·δ，其中ξ为初始设定的异常判决阈值调整系数；⑥ Calculate the mean value μ and standard deviation δ of the enhanced angle anomaly factor of all data points obtained in ⑤, and use the mean value and standard deviation to calculate the abnormal judgment threshold θ, θ=μ+ξ·δ, where ξ is the initial setting Abnormal judgment threshold adjustment coefficient;

⑦将⑤中得到的各点增强型角度异常因子EAOF(X_j)与⑥中得到判决阈值θ进行比较，若满足EAOF(X_j)＞θ，则将该点标记为该簇中的候选异常对象，并存入本簇的候选异常点集合O_i中。⑦Compare the enhanced angular anomaly factor EAOF(X _j ) obtained in ⑤ with the decision threshold θ obtained in ⑥. If EAOF(X _j )>θ is satisfied, mark the point as a candidate anomaly in the cluster object, and stored in the candidate outlier set O _i of this cluster.

本实施例提供了一种基于增强型角度异常因子的数据流异常检测及多重验证的方法，该方法采用了滑动窗口和基本窗口相结合的技术，构建了高效的实时数据流处理技术，同时引入了增强型角度异常因子，在解决传统方法内存占用率高，处理数据效率低的同时，又保证了方法的实时性好，异常检测准确率高及时间复杂度较低的优点。This embodiment provides a method for data stream anomaly detection and multiple verification based on an enhanced angle anomaly factor. The method adopts the technology of combining sliding window and basic window to construct an efficient real-time data stream processing technology. At the same time, it introduces The enhanced angle anomaly factor is adopted, which not only solves the traditional method's high memory occupancy rate and low data processing efficiency, but also ensures the method's good real-time performance, high anomaly detection accuracy and low time complexity.

为了验证本实施例方法的有效性，将通过仿真结果对比进一步说明：In order to verify the effectiveness of the method of this embodiment, it will be further explained by comparing the simulation results:

本实施例在人工生成的数据集和真实数据集中均进行了验证，并与传统方法I-IncLOF和Thakran等人提出的基于加权聚类的数据流非监督异常检测方法(简记为方法I)进行了对比，实验数据集信息如表1所示，表1为实验数据集信息，三个数据集分别是具有不同维度、不同数据量、不同数据特性的数据集。This embodiment has been verified in both artificially generated datasets and real datasets, and is in agreement with the traditional method I-IncLOF and the weighted clustering-based data stream unsupervised anomaly detection method (referred to as Method I) proposed by Thakran et al. For comparison, the experimental data set information is shown in Table 1. Table 1 is the experimental data set information. The three data sets are data sets with different dimensions, different data amounts, and different data characteristics.

人工数据集1的数据分布如图11a所示，共有1615个数据点，由5个簇和15个离散点组成，其中簇1是由高斯分布N₁(u₁,Σ₁)生成的500个数据点组成，簇2是由高斯分布N₂(u₂,Σ₂)生成的500个数据点组成，簇3是由高斯分布N₃(u₃,Σ₃)生成的500个数据点组成，簇4和簇5分别是由高斯分布N₄(u₄,Σ₄)和N₅(u₅,Σ₅)生成的50个数据点组成，且由于N₄和N₅数据点数极少，因此被视作异常簇。同时，根据数据集的分布特性，随机生成了15个离散的异常点，因此该数据集共包含115个异常点，分布情况见图11b，异常点用圆圈标出，在实验过程中，异常簇和离散异常点随机混入正常簇中，以下是高斯分布生成数据集1所使用的参数：The data distribution of artificial dataset 1 is shown in Figure 11a, with a total of 1615 data points, consisting of 5 clusters and 15 discrete points, of which cluster 1 is 500 generated by Gaussian distribution N ₁ (u ₁ , Σ ₁ ) It consists of data points, cluster 2 is composed of 500 data points generated by Gaussian distribution N ₂ (u ₂ , Σ ₂ ), and cluster 3 is composed of 500 data points generated by Gaussian distribution N ₃ (u ₃ , Σ ₃ ), Cluster 4 and cluster 5 are composed of 50 data points generated by Gaussian distributions N ₄ (u ₄ , Σ ₄ ) and N ₅ (u ₅ , Σ ₅ ), respectively, and since the number of N ₄ and N ₅ data points is very small, so Considered an anomalous cluster. At the same time, according to the distribution characteristics of the data set, 15 discrete abnormal points are randomly generated, so the data set contains a total of 115 abnormal points. The distribution is shown in Figure 11b. The abnormal points are marked with circles. During the experiment, abnormal clusters and discrete outliers randomly mixed into normal clusters, the following are the parameters used to generate dataset 1 from a Gaussian distribution:

μ₁＝[+1 +1]，μ₂＝[-1 -1],μ₃＝[+1 -1],μ₄＝[-1 +1],μ₅＝[0 0]μ ₁ =[+1 +1], μ ₂ =[-1 -1], μ ₃ =[+1 -1], μ ₄ =[-1 +1], μ ₅ =[0 0]

人工数据集2的数据分布如图12a所示，共有860个数据点，由3个正常簇和1个异常簇以及48个离散异常点组成，其中异常簇由21个异常点组成。因此该数据集共有69个异常点，异常点的分布情况如图12b所示。The data distribution of artificial dataset 2 is shown in Figure 12a, with a total of 860 data points, consisting of 3 normal clusters, 1 abnormal cluster and 48 discrete abnormal points, of which the abnormal cluster consists of 21 abnormal points. Therefore, there are 69 outliers in this dataset, and the distribution of outliers is shown in Figure 12b.

真实数据集Breast Cancer如表1所示，该数据集来源于UCI机器学习库，包含699个数据点，有两个正常簇组成，其中，为了验证方法的有效性，本实施例根据均值、方差等统计特性向真实数据集中添加了34个异常点，用于异常检测的对比验证。The real data set Breast Cancer is shown in Table 1. The data set comes from the UCI machine learning library, contains 699 data points, and consists of two normal clusters. Statistical features such as 34 anomaly points are added to the real dataset for comparative verification of anomaly detection.

在本实施例方法的验证实验中，设置基础窗口的长度为20，两个基础窗口组成一个滑动窗口，且最近邻点数k＝3，空间邻域半径定为当前时刻滑动窗口中数据点间距离值降序排列的前20％距离值的均值，异常判决阈值调整系数定为2.5，多重验证的次数定为3次，同时，选取最能体现异常检测方法有效性的检测率和误判率进行比较，如图11a-11d、图12a-12d所示，为数据集1和数据集2的可视化实验结果。In the verification experiment of the method of this embodiment, the length of the base window is set to 20, two base windows form a sliding window, and the number of nearest neighbors k=3, and the radius of the spatial neighborhood is set as the distance between data points in the sliding window at the current moment The average value of the top 20% of the distance values in descending order, the abnormal judgment threshold adjustment coefficient is set to 2.5, and the number of multiple verifications is set to 3 times. , as shown in Figures 11a-11d and Figures 12a-12d, which are the visual experimental results of dataset 1 and dataset 2.

对于人工数据集1，从图11a-11d可以看出，运用该方法，2个异常簇和15个离散异常点都能被有效地检测出来，达到了零漏检的效果，从图11d可知，有3个正常点被错检为异常点，原因是这些正常点虽然是由正常的高斯分布生成的，但偏离正常簇稍远，在连续3次的多重验证中均表现为异常，因此被判定为异常点；For artificial data set 1, it can be seen from Figure 11a-11d that using this method, 2 abnormal clusters and 15 discrete abnormal points can be effectively detected, achieving the effect of zero missed detection. From Figure 11d, it can be seen that, There are 3 normal points that were wrongly detected as abnormal points. The reason is that although these normal points are generated by normal Gaussian distribution, they deviate from the normal cluster a little bit, and they are abnormal in 3 consecutive multiple verifications, so they are judged. is an abnormal point;

对于人工数据集2，从图12a-12d可以看出，在三维数据空间中，该方法仍然能保持很好的有效性，从图12b、12c、12d可以看出，异常簇中的点均能被检测出来，而在48个离散异常点中有47个被检测出来，有一个离散异常点漏检，漏检的原因是该漏检点离正常簇距离较近，使得在多重验证中某次表现为正常，因此被判定为正常点。For artificial dataset 2, it can be seen from Figures 12a-12d that in the three-dimensional data space, the method can still maintain good effectiveness. From Figures 12b, 12c, 12d, it can be seen that the points in the abnormal cluster It was detected, and 47 of the 48 discrete anomaly points were detected, and one discrete anomaly point was missed. The reason for the missed detection was that the missing point was close to the normal cluster, which made a certain performance in multiple verifications. is normal, so it is judged as a normal point.

在验证本实施例方法有效性的同时，还将本实施例方法与传统方法进行了对比，进一步验证了本实施例方法的优势，如表2所示，表2为实验结果统计信息，三个数据集上对三种方法进行对比实验的详细统计结果。从表2可以看出，本实施例提出的方法检测率高，误判率低，有效性明显优于另外两种方法，而且当数据集的维数越高时该方法的优越性越明显，方法I结合了W-K-Means与DBSCAN方法，并动态更新DBSCAN所需的参量和每一维的权重，因此方法I对动态数据流有很好的适应性，但由于其运用的是传统的基于距离和密度的异常度量方式，因此当维数增加时，有效性会降低；而I-IncLOF方法运用的是基于局部密度的思想，同样会受到维度灾难的影响，数据维数较低时其表现较好，但维数增多时，有效性较差。While verifying the effectiveness of the method in this embodiment, the method in this embodiment is compared with the traditional method to further verify the advantages of the method in this embodiment. As shown in Table 2, Table 2 is the statistical information of the experimental results. Detailed statistical results of comparative experiments on the three methods on the dataset. It can be seen from Table 2 that the method proposed in this example has a high detection rate, a low false positive rate, and is significantly more effective than the other two methods. Moreover, when the dimension of the data set is higher, the superiority of this method is more obvious. Method I combines the W-K-Means and DBSCAN methods, and dynamically updates the parameters required by DBSCAN and the weight of each dimension. Therefore, method I has good adaptability to dynamic data flow, but because it uses the traditional distance-based method Therefore, when the dimension increases, the effectiveness will be reduced; while the I-IncLOF method uses the idea of local density, which is also affected by the disaster of dimensionality, and its performance is better when the data dimension is low. Good, but less effective as the number of dimensions increases.

通过以上对不同数据集的验证及与传统方法的对比分析，可见，本实施例提出的基于增强型角度异常因子的数据流异常检测及多重验证的方法具有更好的有效性和可行性。Through the above verification of different data sets and comparative analysis with traditional methods, it can be seen that the method for data flow anomaly detection and multiple verification based on the enhanced angle anomaly factor proposed in this embodiment has better effectiveness and feasibility.

表1Table 1

表2Table 2

Claims

1. A method for data flow abnormity detection and multiple verification based on an enhanced angle abnormity factor is characterized by comprising the following steps:

1) processing the real-time data stream: forming data obtained in a time period into data blocks according to the minimum unit forming the data and the time sequence during acquisition of various real-time data streams acquired by a data acquisition terminal, and forming a data set S processed by a sliding window by using a plurality of data blocks;

2) setting the result obtained after the processing in the step 1) to obtain a data set S in the current sliding window, wherein S is { X ═ X₁,X₂,...,X_nN data points, each data point being represented by its attribute

Wherein

Represents the data point x_iD attributes are used for subsequent clustering and anomaly detection;

3) setting initialization parameters k, r and ξ, wherein k represents the number of k nearest neighbors of a data point, r is the spatial neighborhood radius of the data point, ξ is an abnormal decision threshold adjustment coefficient, and an abnormal decision threshold theta is mu + ξ. delta, wherein mu and delta correspond to the mean value and standard deviation of all data point enhanced angle abnormal factors;

4) obtaining a distance matrix dist, namely combining the data set S in the step 2), calculating the distances among all data points, and obtaining a distance matrix dist of n × n, wherein the dist is [ d_ij]_n×nThe calculation formula is formula (1):

wherein X_iAnd Y_jAre all data points in the set S, and x_ikRepresents the data point x_iThere are k attributes, y_jkRepresents the data point y_jThere are k attributes;

5) obtaining a r neighborhood point set: according to the spatial neighborhood radius r, obtaining an r neighborhood point set of each data point, namely a set of all circled data points at the point by taking the neighborhood radius r as the radius;

6) obtaining an angle factor of a r neighborhood point set

And local density

Obtaining an angle factor of the r neighborhood point set by combining the distance matrix dist

And local density of r neighborhood point set

Wherein N is_rData point x_iR neighborhood of (a);

7) obtaining a dissimilarity degree delta (x)_i): according to the local density of the r neighborhood point set obtained in the step 6)

After sorting, the corresponding dissimilarity degree delta (x) is calculated_i)；

8) Obtaining a cluster heart factor τ (x) for each data point_i): combining the step 6) and the step 7) to obtain the cluster heart factor tau (x)_i) Is formula (5):

cluster heart factor τ (x)_i) To measure how well the data points are at the cluster center;

9) acquiring an attribution matrix: sorting all the data point cluster heart factors obtained in the step 8) in a descending order to obtain tau (p)₁)≥τ(p₂)≥…≥τ(p_n) Wherein p is_nOriginal serial numbers representing corresponding data points, resulting in a home matrix F ═ F for clustering₁,f₂,...,f_n]；

10) Determining cluster centers and clustering: performing cluster center determination and clustering on the data set S by using the cluster center factor and the attribution matrix, forming a set, namely a cluster, from all data points with the same class label to obtain m-C_{center_id}An individual cluster C₁,C₂,...,C_mIn which C is_{center_id}The cluster center is the serial number of the cluster center, and the clustering of the data set S is completed;

11) and respectively carrying out anomaly detection on each clustered cluster: obtaining each cluster C in step 10)_i(i 1,2, …, m), each cluster C in the clustered data set S is first sorted₁,C₂,...,C_mRespectively carrying out anomaly detection to obtain a cluster of anomaly point set O_iFinally, all abnormal point sets O ═ O { O in the data set S are obtained₁,...,O_mThe formula involved in anomaly detection is: intra cluster angle factor

Is formula (7):

where A, B, C are the data points in the data set,

a represents the ith cluster and each data point in the cluster has a d-dimensional attribute,

is within the neighborhood of the data point qThe number of data points of (a),

local increment value H (X)_j) Is formula (8):

distance sum of k nearest neighbors L (X)_j) Is of formula (9):

wherein,

represents the data point X_jK neighborhoods consisting of k nearest neighbors in the cluster to which the neighbors belong;

enhanced angular anomaly factor EAOF (X)_j) Is formula (10):

wherein o is the data point X_jCluster center of the cluster, dist (o, X)_j) Is a data point X_jThe distance from the cluster center of the cluster,

is represented by C_i(i-1, 2, …, m) angle factor of each data point within a cluster relative to the cluster, H (X)_j) Is a local delta value;

12) and (3) multiple verification: and verifying all candidate abnormal points for multiple times, judging the candidate abnormal points which are still shown to be abnormal after limited verification as determined abnormal points, outputting and storing the determined abnormal points, and directly discarding the abnormal points if the candidate abnormal points are shown to be normal points in the verification process.

2. The method for data stream anomaly detection and multi-validation based on enhanced angle anomaly factor as claimed in claim 1, wherein said processing in step 1) refers to data collected by a data collection terminalBuffering in stream form, and dividing buffered data into E₀,E₁,E₂,... each data block represents a basic window, each sliding window W contains 2 basic windows, and the process of inserting and deleting data is realized by combining the basic window and the sliding window, and the process of combining the basic window and the sliding window is as follows: at T_iTime of day transition to T_i+1At the moment, the sliding window is formed by W_iSlide to W_i+1Accompanied by a new basic window E_i+1Merging and history base window E_i-1While removing T_iTime W_iIncorporation of detected candidate outliers into W_i+1In (3) performing multiple validations.

3. The method for data stream anomaly detection and multi-verification based on enhanced angle anomaly factor as claimed in claim 1, wherein the calculation formula of the angle factor of the r neighborhood point set in step 6) is formula (2):

4. the method for data stream anomaly detection and multi-verification based on enhanced angle anomaly factor as claimed in claim 1, wherein said local density calculation formula of r neighborhood point set in step 6) is formula (3):

wherein N is_r(p) is the r neighborhood of the data point p, q is any one data point in the r neighborhood set of the data point p, the local density is related to the number of neighborhood data points and the position of the neighborhood data points, and the more the number of neighborhood data points is, the more the neighborhood data points are located in the center of the data set, the larger the local density is.

5. The enhanced angle anomaly factor based data flow anomaly according to claim 1The method for detection and multiple verification, characterized in that the dissimilarity degree delta (x) in the step 7)_i) The local densities of all data points are sorted in descending order, and the dissimilarity degree delta (x)_i) The calculation formula of (2) is formula (4):

wherein p is_iAnd p_jIs the serial number of the corresponding data point, when i is 1, j is more than or equal to 2; when i is larger than or equal to 2, j is smaller than i.

6. The method for enhanced angle anomaly factor based data stream anomaly detection and multi-verification according to claim 1, wherein said home matrix F ═ F in step 9)₁,f₂,...,f_n]The formula is used for recording the attribution relationship between data points, and the expression formula of each element is formula (6):

wherein, { p_iDenotes the cluster heart factor τ (x)_i) The original subscript numbers sorted in descending order.