CN107682319B - A method for data flow anomaly detection and multiple verification based on enhanced angle anomaly factor - Google Patents
A method for data flow anomaly detection and multiple verification based on enhanced angle anomaly factor Download PDFInfo
- Publication number
- CN107682319B CN107682319B CN201710823063.0A CN201710823063A CN107682319B CN 107682319 B CN107682319 B CN 107682319B CN 201710823063 A CN201710823063 A CN 201710823063A CN 107682319 B CN107682319 B CN 107682319B
- Authority
- CN
- China
- Prior art keywords
- data
- cluster
- point
- factor
- neighborhood
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24568—Data stream processing; Continuous queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/50—Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- External Artificial Organs (AREA)
- Medicines Containing Antibodies Or Antigens For Use As Internal Diagnostic Agents (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域technical field
本发明涉及数据流异常检测和数据聚类,尤其涉及一种基于增强型角度异常因子的数据流异常检测及多重验证的方法。The invention relates to data flow anomaly detection and data clustering, in particular to a method for data flow anomaly detection and multiple verification based on an enhanced angle anomaly factor.
背景技术Background technique
网络技术的飞速发展和社会信息化的不断提高,引发了信息量的爆炸式增长,使得各行各业产生了海量、高速、动态的流数据,如网络入侵监测、商业交易管理和分析、视频监控、传感网络监控等。由于动态数据流的实时无限等特点,传统的静态数据异常检测方法已不能准确有效的分析和处理如此大规模动态增长的流数据,因此构建一种适用于数据流的实时有效异常检测方法变得尤其重要。The rapid development of network technology and the continuous improvement of social informatization have led to the explosive growth of the amount of information, resulting in the generation of massive, high-speed and dynamic streaming data in all walks of life, such as network intrusion monitoring, business transaction management and analysis, and video surveillance. , sensor network monitoring, etc. Due to the infinite real-time characteristics of dynamic data streams, traditional static data anomaly detection methods can no longer accurately and effectively analyze and process such large-scale and dynamically growing streaming data. Therefore, building a real-time effective anomaly detection method suitable for data streams becomes Especially important.
针对不同阶段面临的实际问题,科技工作者提出了不同的数据流异常检测方法。现有的数据流异常检测方法大致可分为基于密度的数据流异常检测方法、基于角度的数据流异常检测方法、基于聚类的数据流异常检测方法。基于密度的数据流异常检测方法运用密度作为最基本的异常度量方式,并构造出能动态更新的用于度量数据异常程度的异常因子,Pokrajac等人将静态数据异常检测方法LOF引用到数据流中,研究出一种能运用到动态数据流中的增量式局部异常检测方法INCLOF,随着新数据的插入,INCLOF删除历史数据并动态更新各数据点的异常因子;Ke Gao等人对INCLOF方法进行了改进,引入滑动窗口的思想,提出了n-INCLOF方法,n-INCLOF方法仅更新当前时刻滑动窗口内各数据对象的异常因子;在有些情况下,有的数据点在某一时刻表现为异常,但在下一时刻并不表现异常,基于这个问题,Karimian S H等人提出了I-IncLOF方法,I-IncLOF方法引入了多重验证的思想,I-IncLOF方法将那些在窗口的整个滑动过程中始终表现为异常的数据对象判定为确定异常点,I-IncLOF方法大大降低了误判率,但I-IncLOF方法在多维情况下有效性较差;XinjieLu等人提出了INCLOCI方法,INCLOCI引入多粒度异常因子MDEF,INCLOCI方法不仅能检测出分散的异常点,而且能检测出异常簇。为了解决距离、密度等相似度度量方式在高维数据空间中有效性降低的问题,一些科研工作者提出了角度的度量方式,角度的相似度度量的基本思想就是异常点与其它点所成的角度普遍偏小且波动范围小、常规点与其它点所成的角度有大有小且波动范围大,HP Kriegel等人提出了基于角度的异常检测方法ABOD,ABOD方法将角度的方差作为度量数据点异常程度的异常因子ABOF,ABOD方法在高维空间中仍然有很高的检测准确率;Ye H提出了基于角度的数据流异常检测方法DSABOD,随着数据流数据点不断流入内存,DSABOD方法动态更新每个数据点相对于其邻域点的异常因子,DSABOD方法为高维数据流中的异常检测提出了一种新的思路,但传统的基于角度的数据流异常检测方法存在异常检测率较低的问题。基于聚类的数据流异常检测方法分为对数据点进行聚类和对每个簇中数据点进行异常检测这两个阶段,Elahi M等人提出了一种基于聚类的数据流异常检测方法,将K-Means和LOF相结合的方法,该方法分区域定义异常因子,提高了方法对异常检测的准确率;Thakran Y等人又提出了将DBSCAN方法与W-K-Means方法相结合的方法,该方法对当前时刻的数据块运用DBSCAN方法进行聚类并得到候选异常点和初始簇,该方法结合前一时刻得到的待多重验证的候选异常点,运用W-K-Means方法再次进行聚类,得到当前时刻的候选异常点和常规点簇,同时该方法对候选异常点采用多重验证删除误判的异常点释放内存,该方法在整个过程中动态调整DBSCAN方法所需要的参量MinPts、Epsilon、W-K-Means方法的属性权重,该方法对异常检测的准确度较高,但需要的人为设定参量过多,人工干预严重,方法的复杂度较高,且该方法在多维空间中的有效性较差。In view of the practical problems faced at different stages, scientific and technological workers have proposed different data flow anomaly detection methods. Existing data stream anomaly detection methods can be roughly divided into density-based data stream anomaly detection methods, angle-based data stream anomaly detection methods, and cluster-based data stream anomaly detection methods. The density-based data stream anomaly detection method uses density as the most basic anomaly measurement method, and constructs an anomaly factor that can be dynamically updated to measure the degree of data anomaly. Pokrajac et al. introduced the static data anomaly detection method LOF to the data stream. , developed an incremental local anomaly detection method INCLOF that can be applied to dynamic data streams. With the insertion of new data, INCLOF deletes historical data and dynamically updates the anomaly factor of each data point; Ke Gao et al. Improvements are made, the idea of sliding window is introduced, and the n-INCLOF method is proposed. The n-INCLOF method only updates the abnormal factor of each data object in the sliding window at the current moment; in some cases, some data points appear as Abnormal, but does not behave abnormally at the next moment. Based on this problem, Karimian S H et al. proposed the I-IncLOF method. The I-IncLOF method introduced the idea of multiple verifications. The I-IncLOF method will be in the whole sliding process of the window The data objects that always appear abnormal are determined to be abnormal points. The I-IncLOF method greatly reduces the false positive rate, but the I-IncLOF method is less effective in multi-dimensional situations; XinjieLu et al. proposed the INCLOCI method, and INCLOCI introduced multi-granularity The anomaly factor MDEF, INCLOCI method can not only detect scattered outliers, but also detect outlier clusters. In order to solve the problem that the similarity measurement methods such as distance and density are less effective in high-dimensional data space, some researchers have proposed angle measurement methods. The basic idea of angle similarity measurement is that abnormal points and other points are formed. The angle is generally small and the fluctuation range is small. The angle formed by the conventional point and other points is large or small and the fluctuation range is large. HP Kriegel et al. proposed an angle-based anomaly detection method ABOD. The ABOD method uses the variance of the angle as the measurement data. The anomaly factor ABOF of the point anomaly degree, the ABOD method still has a high detection accuracy in high-dimensional space; Ye H proposed an angle-based data flow anomaly detection method DSABOD, as the data flow data points continue to flow into the memory, the DSABOD method Dynamically update the anomaly factor of each data point relative to its neighbor points. The DSABOD method proposes a new idea for anomaly detection in high-dimensional data streams, but the traditional angle-based data stream anomaly detection methods have anomaly detection rates. lower problem. Clustering-based data flow anomaly detection method is divided into two stages: clustering data points and anomaly detection of data points in each cluster. Elahi M et al. proposed a clustering-based data flow anomaly detection method. , the method of combining K-Means and LOF, this method defines anomaly factors in different regions, which improves the accuracy of the method for anomaly detection; Thakran Y et al. proposed a method that combines the DBSCAN method with the W-K-Means method, This method uses the DBSCAN method to cluster the data blocks at the current moment and obtains candidate outliers and initial clusters. The method combines the candidate outliers obtained at the previous moment to be multi-validated, and uses the W-K-Means method to cluster again, and obtains The candidate outliers and regular point clusters at the current moment, at the same time, the method uses multiple verifications for the candidate outliers to delete the misjudged outliers to release the memory, and the method dynamically adjusts the parameters MinPts, Epsilon, W-K- The attribute weight of the Means method. This method has high accuracy for anomaly detection, but it requires too many artificial parameters, serious manual intervention, high complexity of the method, and poor effectiveness in multi-dimensional space. .
数据流异常检测是现今数据挖掘领域中的研究热点和难点,主要目标是从动态变化的复杂数据环境中实时准确的检测出与常规模式不相符的信息。Data stream anomaly detection is a research hotspot and a difficult point in the field of data mining.
发明内容SUMMARY OF THE INVENTION
针对传统方法的时间复杂度高、内存占用大且使用效率低、人为参数干预过多以及多维数据环境下有效性低等问题,本发明提供了一种基于增强型角度异常因子的数据流异常检测及多重验证的方法。这种方法能降低内存的占用率,实时性好、异常检测准确率高、时间复杂度低。Aiming at the problems of high time complexity, large memory occupation, low use efficiency, excessive human parameter intervention, and low effectiveness in multi-dimensional data environment of traditional methods, the present invention provides a data flow anomaly detection based on an enhanced angle anomaly factor and multiple authentication methods. This method can reduce the occupancy rate of memory, has good real-time performance, high anomaly detection accuracy, and low time complexity.
实现本发明目的的技术方案是:The technical scheme that realizes the object of the present invention is:
一种基于增强型角度异常因子的数据流异常检测及多重验证的方法,包括如下步骤:A method for data flow anomaly detection and multiple verification based on an enhanced angle anomaly factor, comprising the following steps:
1)对实时数据流进行处理:对数据采集终端采集的各式各样的实时数据流,按照构成数据的最小单位和采集时的时间顺序,将一个时间段获得的数据组成数据块,将几个数据块构成滑动窗口处理的数据集S,为步骤2)处理做准备;1) Process the real-time data stream: For various real-time data streams collected by the data collection terminal, according to the smallest unit of data and the time sequence of collection, the data obtained in a period of time is formed into data blocks, and several The data blocks constitute the data set S processed by the sliding window, and prepare for the processing in step 2);
2)对所述步骤1)处理后得到的结果进行设置得到当前滑动窗口中数据集S,设S={X1,X2,...,Xn},其中包含n个数据点,每个数据点根据其属性表示为其中表示数据点xi有d个属性,用于后续聚类和异常检测;2) Set the result obtained after the processing in step 1) to obtain the data set S in the current sliding window, set S={X 1 , X 2 ,..., X n }, which contains n data points, each data points are represented according to their properties as in Indicates that the data point x i has d attributes for subsequent clustering and anomaly detection;
3)初始化参数k、r、ξ:设定初始化参数,其中k表示数据点的k个最近邻点的个数,r为数据点的空间邻域半径,ξ为异常判决阈值调整系数,异常判决阈值θ=μ+ξ·δ,其中μ和δ对应所有数据点增强型角度异常因子的均值及标准差;3) Initialization parameters k, r, ξ: set the initialization parameters, where k represents the number of k nearest neighbors of the data point, r is the radius of the spatial neighborhood of the data point, ξ is the adjustment coefficient of abnormal judgment threshold, abnormal judgment Threshold θ=μ+ξ·δ, where μ and δ correspond to the mean and standard deviation of the enhanced angle anomaly factor of all data points;
4)获取距离矩阵dist:结合步骤2)中的数据集S,计算所有数据点之间的距离,得到n×n的距离矩阵dist,dist=[dij]n×n,计算公式为公式(1):4) Obtain the distance matrix dist: Combine the data set S in step 2), calculate the distance between all data points, and obtain the distance matrix dist of n×n, dist=[d ij ] n×n , the calculation formula is the formula ( 1):
其中Xi和Yj都是集合S中的数据点,并且xik表示数据点xi有k个属性,yjk表示数据点yj有k个属性;where X i and Y j are both data points in the set S, and x ik indicates that the data point x i has k attributes, and y jk indicates that the data point y j has k attributes;
5)得到r邻域点集合:根据空间邻域半径r,得到每个数据点的r邻域点集合即该点处以邻域半径r为半径,圈住的所有数据点的集合;5) Obtain the r neighborhood point set: according to the spatial neighborhood radius r, obtain the r neighborhood point set of each data point, that is, the set of all data points circled at the point with the neighborhood radius r as the radius;
6)得到r邻域点集合的角度因子和局部密度结合距离矩阵dist得到r邻域点集合的角度因子和r邻域点集合的局部密度其中Nr数据点xi的r邻域;6) Get the angle factor of the r neighborhood point set and local density Combine the distance matrix dist to get the angle factor of the r neighborhood point set and r the local density of the set of neighbor points where N r the r neighborhood of data point x i ;
7)获取相异度δ(xi):根据步骤6)中得到的r邻域点集合的局部密度进行排序后,计算对应的相异度δ(xi);7) Obtain the dissimilarity δ(x i ): according to the local density of the r neighborhood point set obtained in step 6) After sorting, calculate the corresponding dissimilarity δ(x i );
8)获取各数据点的簇心因子τ(xi):结合步骤6)和步骤7)获取簇心因子τ(xi),计算公式为公式(5):簇心因子τ(xi)用来度量数据点处于簇心的程度;8) Obtain the cluster center factor τ(x i ) of each data point: Combine step 6) and step 7) to obtain the cluster center factor τ(x i ), and the calculation formula is formula (5): The cluster center factor τ(x i ) is used to measure the degree to which the data points are located in the cluster center;
9)获取归属矩阵:对步骤8)中得到的所有数据点簇心因子,进行降序排序得τ(p1)≥τ(p2)≥…≥τ(pn),其中pn表示对应数据点的原始序号,从而得到用于聚类的归属矩阵F=[f1,f2,...,fn];9) Obtain the attribution matrix: sort all the data point cluster center factors obtained in step 8) in descending order to obtain τ(p 1 )≥τ(p 2 )≥...≥τ(p n ), where pn represents the corresponding data The original sequence number of the point, so as to obtain the attribution matrix F=[f 1 , f 2 ,..., f n ] for clustering;
10)确定簇心并聚类:利用簇心因子和归属矩阵对数据集S进行簇心确定及聚类,将所有类标号相同的数据点组成集合,即簇,得到m=Ccenter_id个簇C1,C2,...,Cm,其中Ccenter_id为簇心序号,完成对数据集S的聚类;10) Determine the cluster center and cluster: use the cluster center factor and the attribution matrix to determine and cluster the data set S, and form a set of all data points with the same class label, that is, a cluster, and obtain m=C center_id clusters C 1 ,C 2 ,...,C m , where C center_id is the cluster center number, and the clustering of the data set S is completed;
11)对聚类后的各簇分别进行异常检测:步骤10)中得到各个簇Ci(i=1,2,…,m),先对聚类后的数据集S中的各个簇C1,C2,...,Cm,分别进行异常检测,得到各簇异常点集合Oi,最终得到数据集S中所有异常点集合O={O1,...,Om},异常检测所涉及的公式有:簇内角度因子为公式(7):11) Perform anomaly detection on each clustered cluster: in step 10), each cluster C i (i=1, 2, ..., m) is obtained, first, each cluster C 1 in the clustered data set S is ,C 2 ,...,C m , carry out anomaly detection respectively, obtain each cluster outlier set O i , and finally obtain all outlier sets in the data set S O={O 1 ,...,O m }, anomaly The formulas involved in the detection are: In-cluster angle factor is formula (7):
其中A、B、C是数据集中的数据点,阿表示第个i簇且簇中每个数据点均有d维属性,是数据点q邻域内的数据点个数,where A, B, C are the data points in the dataset, A represents the i-th cluster and each data point in the cluster has a d-dimensional attribute, is the number of data points in the neighborhood of data point q,
局部增量值H(Xj)为公式(8): The local increment value H(X j ) is formula (8):
k个最近邻点的距离和L(Xj)为公式(9):The distance sum L(X j ) of the k nearest neighbors is formula (9):
其中,表示数据点Xj在其所属簇中的k个最近邻点组成的k邻域;in, Represents the k neighborhood of the k nearest neighbors of the data point X j in the cluster to which it belongs;
增强型角度异常因子EAOF(Xj)为公式(10):The enhanced angular anomaly factor EAOF(X j ) is formula (10):
其中,o是数据点Xj所属簇的簇心,dist(o,Xj)为数据点Xj与其簇心的距离,表示簇Ci(i=1,2,…,m)内每个数据点相对于该簇的角度因子,H(Xj)是局部增量值;Among them, o is the cluster center of the cluster to which the data point X j belongs, dist(o, X j ) is the distance between the data point X j and its cluster center, represents the angle factor of each data point in the cluster C i (i=1,2,...,m) relative to the cluster, H(X j ) is the local increment value;
12)多重验证:对所有候选异常点进行多次验证,并将经过有限次验证仍表现为异常的候选异常点判决为确定异常点并输出保存,若在验证过程中表现为正常点,则将该异常点直接丢弃。12) Multiple verification: Perform multiple verifications on all candidate abnormal points, and judge the candidate abnormal points that are still abnormal after limited verification as the determined abnormal points and output and save them. If they are normal points during the verification process, they will be The outlier is directly discarded.
步骤1)中所述的处理是指数据采集终端采集的数据以流的形式缓存,并将缓存的数据划分成E0,E1,E2,......数据块,每一个数据块代表一个基础窗口,每个滑动窗口W包含ε=2个基本窗口,采用基础窗口和滑动窗口相结合,实现数据的插入和删除,基础窗口和滑动窗口相结合的过程为:在Ti时刻过渡到Ti+1时刻,滑动窗口由Wi滑到Wi+1,伴随着新基础窗口Ei+1的并入和历史基础窗口Ei-1的移除,同时,将Ti时刻Wi检测的候选异常点并入到Wi+1中进行多重验证。The processing described in step 1) means that the data collected by the data collection terminal is buffered in the form of a stream, and the buffered data is divided into E 0 , E 1 , E 2 , . . . data blocks, each data block The block represents a basic window, and each sliding window W contains ε=2 basic windows. The combination of the basic window and the sliding window is used to realize the insertion and deletion of data. The process of combining the basic window and the sliding window is: At time T i Transitioning to time T i+1 , the sliding window is slid from Wi to Wi +1 , along with the merging of the new base window E i +1 and the removal of the historical base window E i-1 , at the same time, the time T i is Candidate outliers detected by Wi are incorporated into Wi +1 for multiple validation.
步骤6)中所述的r邻域点集合的角度因子计算公式为公式(2):The calculation formula of the angle factor of the r neighborhood point set described in step 6) is formula (2):
步骤6)中所述的r邻域点集合的局部密度计算公式为公式(3):The formula for calculating the local density of the r neighborhood point set described in step 6) is formula (3):
其中Nr(p)是数据点p的r邻域,q是数据点p的r邻域集合中的任意一个数据点,局部密度与邻域数据点数以及所处的位置有关,邻域数据点数越多、越处于数据集的中心,局部密度就越大。where N r (p) is the r neighborhood of the data point p, q is any data point in the r neighborhood set of the data point p, the local density is related to the number of neighborhood data points and their location, the number of neighborhood data points The more and the more in the center of the dataset, the greater the local density.
步骤7)中所述的相异度δ(xi)是对所有数据点的局部密度进行降序排序后得到,相异度δ(xi)的计算公式为公式(4):The dissimilarity δ(x i ) described in step 7) is obtained by sorting the local densities of all data points in descending order, and the calculation formula of the dissimilarity δ(x i ) is formula (4):
其中pi和pj是对应数据点的序号,当i=1时,j≥2;当i≥2时,j<i。Where p i and p j are the serial numbers of the corresponding data points, when i=1, j≥2; when i≥2, j<i.
步骤9)中所述的归属矩阵F=[f1,f2,...,fn],是用于记录数据点之间归属关系,各元素的表示公式为公式(6):The attribution matrix F=[f 1 , f 2 , .
其中,{pi}表示簇心因子τ(xi)降序排序后的原始下标序号。Among them, {pi } represents the original index number after the cluster center factor τ(x i ) is sorted in descending order.
这种数据流异常检测方法,分为2个过程,即数据流处理过程和数据流异常检测过程。数据流处理过程是将动态的数据流转化为静态数据块,方便后续异常检测,同时又要保证整个检测的实时性和高效性;数据流异常检测过程是对经数据流处理过程处理好之后的静态数据集进行异常检测,为了提高异常检测的准确率,采用了先聚类后异常检测的方法。在本技术方案中,滑动窗口和基础窗口结合的实时数据流处理方法,是数据流处理过程的核心,其在减少内存占用率的同时,提高了后续异常检测的质量,簇心因子和归属矩阵是本技术方案中新引进用于确定簇心和聚类的两个参数,能快速有效的确定多维数据空间的簇心,并根据确定的簇心准确聚类;增强型角度异常因子是本技术方案中另一个重要的参数,其弥补了传统异常因子的部分缺陷,保留了角度度量方式在多维空间中的有效性,是异常检测部分的核心。This data stream anomaly detection method is divided into two processes, namely, a data stream processing process and a data stream anomaly detection process. The data stream processing process is to convert the dynamic data stream into static data blocks, which is convenient for subsequent abnormal detection, and at the same time ensures the real-time and high efficiency of the whole detection; the data stream abnormal detection process is processed after the data stream processing process. Anomaly detection is performed on static data sets. In order to improve the accuracy of anomaly detection, a method of clustering first and then anomaly detection is adopted. In this technical solution, the real-time data stream processing method combining the sliding window and the basic window is the core of the data stream processing process, which not only reduces the memory occupancy rate, but also improves the quality of subsequent anomaly detection, cluster center factor and attribution matrix. It is two parameters newly introduced in this technical solution for determining the cluster center and clustering, which can quickly and effectively determine the cluster center of the multi-dimensional data space, and accurately cluster according to the determined cluster center; the enhanced angle anomaly factor is the technology of this technology. Another important parameter in the scheme, which makes up for some of the defects of the traditional anomaly factor, retains the validity of the angle measurement method in the multi-dimensional space, and is the core of the anomaly detection part.
这种方法运用了滑动窗口和基本窗口技术,构造了高效的数据流处理模型,降低了内存的占用率、实时性好、异常检测准确率高、时间复杂度低。This method uses sliding window and basic window technology to construct an efficient data stream processing model, which reduces the memory usage rate, has good real-time performance, high anomaly detection accuracy, and low time complexity.
附图说明Description of drawings
图1为实施例中方法流程示意图;1 is a schematic flowchart of a method in an embodiment;
图2为实施例中t1时刻滑动窗口中数据点分布图的示意图;Fig. 2 is the schematic diagram of the distribution map of data points in the sliding window at time t1 in the embodiment;
图3为实施例中t2时刻滑动窗口中数据点分布图的示意图;Fig. 3 is the schematic diagram of the distribution map of data points in the sliding window at time t 2 in the embodiment;
图4为实施例中滑动窗口和基础窗口结合处理实时数据流及多重验证过程示意图;4 is a schematic diagram of a sliding window and a base window combined to process real-time data streams and multiple verification processes in the embodiment;
图5为实施例中数据点角度度量示意图;Fig. 5 is a schematic diagram of data point angle measurement in an embodiment;
图6为实施例中U型簇数据基于传统角度度量方法误判数据点分布示意图;6 is a schematic diagram of the misjudged data point distribution of U-shaped cluster data based on a traditional angle measurement method in an embodiment;
图7为实施例中多簇数据基于传统角度度量方法误判数据点分布示意图;7 is a schematic diagram of the distribution of misjudged data points based on traditional angle measurement methods for multi-cluster data in an embodiment;
图8为实施例中数据集原始坐标分布示意图;8 is a schematic diagram of the original coordinate distribution of the dataset in the embodiment;
图9为实施例中局部密度-相异度分布示意图;9 is a schematic diagram of local density-dissimilarity distribution in an embodiment;
图10为实施例中簇心因子排序分布示意图;10 is a schematic diagram of the cluster center factor ranking distribution in the embodiment;
图11a为实施例中数据集1分布示意图;11a is a schematic diagram of the distribution of
图11b为实施例中数据集1异常点分布示意图;11b is a schematic diagram of the distribution of abnormal points in
图11c为实施例中数据集1异常检测检测出的异常点标识示意图;11c is a schematic diagram of anomalous point identification detected by anomaly detection of
图11d为实施例中数据集1异常检测将正常点误检为异常点标识示意图;11d is a schematic diagram of anomaly detection of data set 1 in an embodiment where a normal point is mistakenly detected as an abnormal point identification;
图12a为实施例中数据集2分布示意图;Figure 12a is a schematic diagram of the distribution of
图12b为实施例中数据集2实际异常分布示意图;Figure 12b is a schematic diagram of the actual abnormal distribution of
图12c为实施例中数据集2异常检测检测出的异常点标识示意图;12c is a schematic diagram of anomalous point identification detected by anomaly detection of
图12d为实施例中数据集2异常检测将异常点检测为正常点标识示意图。FIG. 12d is a schematic diagram showing the identification of abnormal points detected as normal points by abnormality detection of
具体实施方式Detailed ways
下面结合附图和实施例对本发明内容作进一步的阐述,但不是对本发明的限定。The content of the present invention will be further described below with reference to the accompanying drawings and embodiments, but it is not intended to limit the present invention.
参照图1,一种基于增强型角度异常因子的数据流异常检测及多重验证的方法,包括如下步骤:Referring to FIG. 1, a method for abnormality detection and multiple verification of data flow based on an enhanced angle abnormality factor, comprising the following steps:
1)对实时数据流进行处理:对数据采集终端采集的各式各样的实时数据流进行处理,实时数据流具有动态多变特征,有的数据对象在当前滑动窗口中表现为异常,但在下一时刻的滑动窗口中却表现为正常点,如图2和图3所示,图2为t1时刻滑动窗口数据点的分布图,此时的P′点表现为异常,但随着数据点持续地流入,P′点周围积累了越来越多的数据点,图3为t2时刻滑动窗口数据点的分布图,可以看出此时P′点表现为正常;1) Process real-time data streams: Process various real-time data streams collected by the data acquisition terminal. The real-time data streams have dynamic and changeable characteristics. Some data objects appear abnormal in the current sliding window, but in the following The sliding window at a moment appears to be a normal point, as shown in Figure 2 and Figure 3, Figure 2 is the distribution diagram of the sliding window data points at time t1 , the P' point at this time is abnormal, but with the data points Continuous inflow, more and more data points are accumulated around the P' point. Figure 3 is the distribution diagram of the sliding window data points at time t 2. It can be seen that the P' point is normal at this time;
2)设置滑动窗口中数据集S:步骤1)处理得到当前滑动窗口中数据集S:设S={X1,X2,...,Xn},其中包含n个数据点,每个数据点根据其属性表示为用于后续聚类和异常检测;2) Set the data set S in the sliding window: Step 1) Process to obtain the data set S in the current sliding window: set S = {X 1 , X 2 ,..., X n }, which contains n data points, each Data points are represented according to their properties as For subsequent clustering and anomaly detection;
3)初始化参数k、r、ξ:设定初始化参数,其中k表示数据点的k个最近邻个数,r为数据点的空间邻域半径,ξ为异常判决阈值调整系数,异常判决阈值θ=μ+ξ·δ,其中μ和δ对应所有数据点增强型角度异常因子的均值及标准差;3) Initialization parameters k, r, ξ: set the initialization parameters, where k represents the number of k nearest neighbors of the data point, r is the radius of the spatial neighborhood of the data point, ξ is the adjustment coefficient of the abnormal judgment threshold, and the abnormal judgment threshold θ =μ+ξ·δ, where μ and δ correspond to the mean and standard deviation of the enhanced angle anomaly factor of all data points;
4)获取距离矩阵dist:结合步骤2)中的数据集S,计算所有数据点之间的距离,得到n×n的距离矩阵dist,dist=[dij]n×n,计算公式为公式(1):4) Obtain the distance matrix dist: Combine the data set S in step 2), calculate the distance between all data points, and obtain the distance matrix dist of n×n, dist=[d ij ] n×n , the calculation formula is the formula ( 1):
其中Xi和Yj都是集合S中的数据点,并且xik表示数据点xi有k个属性,yjk表示数据点yj有k个属性; where X i and Y j are both data points in the set S, and x ik indicates that the data point x i has k attributes, and y jk indicates that the data point y j has k attributes;
5)得到r邻域点集合:根据空间邻域半径r,得到每个数据点的r邻域点集合即该点处以邻域半径r为半径,圈住的所有数据点的集合;5) Obtain the r neighborhood point set: according to the spatial neighborhood radius r, obtain the r neighborhood point set of each data point, that is, the set of all data points circled at the point with the neighborhood radius r as the radius;
6)得到r邻域点集合的角度因子和局部密度结合距离矩阵dist得到r邻域点集合的角度因子和r邻域点集合的局部密度其中Nr数据点xi的r邻域;如图5所示,是基于角度的度量思想,计算该数据点与其它每一对数据点所成的角度,再取方差,对于核心区域点A1,它与其它点对所成的角度变化范围很大,故方差大;对于异常点A3,它与其它点对所成的角度变化范围很小,故方差小;而对于边界点A2,它与其它点对所成的角度变化范围介于A1和A3变化范围之间,故方差也介于核心区域点与异常点之间,但这样存在一些缺陷,如图6和图7所示,图6中的异常点B1位于U形簇的中心,它与周围点对所成的角度变化范围大,即方差较大,而边缘点B2与其它点对所成的角度变化范围小,即方差较小;同样的,图7中的异常点D1位于两个簇的中间,该点与两簇间的点对所成的角度变化范围大,而边缘点D2与其它点对所成的角度变化范围较小;这将使得到的结果与实际结果刚好相反,出现漏、误判;6) Get the angle factor of the r neighborhood point set and local density Combine the distance matrix dist to get the angle factor of the r neighborhood point set and r the local density of the set of neighbor points Among them, N r is the r neighborhood of the data point xi ; as shown in Figure 5, it is based on the angle measurement idea, calculate the angle formed by the data point and each other pair of data points, and then take the variance, for the core area point A 1 , the angle formed by it and other point pairs has a large variation range, so the variance is large; for the abnormal point A 3 , the angle formed by it and other point pairs has a small variation range, so the variance is small; and for the boundary point A 2 , the variation range of the angle formed by it and other point pairs is between the variation range of A 1 and A 3 , so the variance is also between the core area point and the abnormal point, but there are some defects, as shown in Figure 6 and Figure 7 As shown, the abnormal point B1 in Figure 6 is located in the center of the U-shaped cluster, and the angle formed by it and the surrounding point pairs has a large variation range, that is, the variance is large, while the angle formed by the edge point B2 and other point pairs varies The range is small, that is, the variance is small; similarly, the abnormal point D 1 in Figure 7 is located in the middle of the two clusters, and the angle formed by this point and the point pair between the two clusters has a large variation range, while the edge point D 2 and other The variation range of the angle formed by the point pair is small; this will make the obtained result just opposite to the actual result, and there will be leakage and misjudgment;
7)获取相异度δ(xi):根据步骤6)中得到的r邻域点集合的局部密度进行排序后,计算对应的相异度δ(xi);7) Obtain the dissimilarity δ(x i ): according to the local density of the r neighborhood point set obtained in step 6) After sorting, calculate the corresponding dissimilarity δ(x i );
8)获取各数据点的簇心因子τ(xi):结合步骤6)和步骤7)获取簇心因子τ(xi),计算公式为公式(5):簇心因子τ(xi)用来度量数据点处于簇心的程度,簇心因子是实施例方法中,用于快速有效确定多维数据空间簇心的改进参数因子,是聚类中至关重要的一步,实现过程如图8、图9、图10所示,可以看出该数据集由两个簇构成,其中点13和点25分别为两个簇的簇心;图9为通过公式(3)和公式(4)所得到的该数据集中各点的ρ-δ(局部密度-相异度)分布示意图,可以看出点13和点25的局部密度和相异度都较大;图10为通过公式(5)所得到的各点的簇心因子降序排序后的分布示意图,可以看出点13和点25的簇心因子最大,因此最有可能为簇心;8) Obtain the cluster center factor τ(x i ) of each data point: Combine step 6) and step 7) to obtain the cluster center factor τ(x i ), and the calculation formula is formula (5): The cluster center factor τ(x i ) is used to measure the degree to which the data points are located in the cluster center. The cluster center factor is an improved parameter factor used to quickly and effectively determine the cluster center in the multidimensional data space in the embodiment method, and is an important factor in clustering. The implementation process is shown in Figure 8, Figure 9, and Figure 10. It can be seen that the data set is composed of two clusters, in which
9)获取归属矩阵:对步骤8)中得到的所有数据点簇心因子,进行降序排序得τ(p1)≥τ(p2)≥…≥τ(pn),其中pn表示对应数据点的原始序号,从而得到用于聚类的归属矩阵F=[f1,f2,...,fn];9) Obtain the attribution matrix: sort all the data point cluster center factors obtained in step 8) in descending order to obtain τ(p 1 )≥τ(p 2 )≥...≥τ(p n ), where pn represents the corresponding data The original sequence number of the point, so as to obtain the attribution matrix F=[f 1 , f 2 ,..., f n ] for clustering;
10)确定簇心并聚类:利用簇心因子和归属矩阵对数据集S进行簇心确定及聚类,将所有类标号相同的数据点组成集合,即簇,得到m(m=Ccenter_id)个簇C1,C2,...,Cm,完成对数据集S的聚类;10) Determine the cluster center and cluster: use the cluster center factor and the attribution matrix to determine and cluster the data set S, and form a set of all data points with the same class label, that is, a cluster, to obtain m (m=C center_id ) cluster C 1 , C 2 ,...,C m to complete the clustering of the dataset S;
11)对聚类后的各簇分别进行异常检测:步骤10)中得到各个簇Ci(i=1,2,…,m),先对聚类后的数据集S中的各个簇C1,C2,...,Cm,分别进行异常检测,得到各簇异常点集合Oi,最终得到数据集S中所有异常点集合O={O1,...,Om},异常检测所涉及的公式有:簇内角度因子为公式(7):11) Perform anomaly detection on each clustered cluster: in step 10), each cluster C i (i=1, 2, ..., m) is obtained, first, each cluster C 1 in the clustered data set S is ,C 2 ,...,C m , carry out anomaly detection respectively, obtain each cluster outlier set O i , and finally obtain all outlier sets in the data set S O={O 1 ,...,O m }, anomaly The formulas involved in the detection are: In-cluster angle factor is formula (7):
其中A、B、C是数据集中的数据点,阿表示第个i簇且簇中每个数据点均有d维属性,是数据点q邻域内的数据点个数,局部增量值H(Xj)为公式(8): where A, B, C are the data points in the dataset, A represents the i-th cluster and each data point in the cluster has a d-dimensional attribute, is the number of data points in the neighborhood of data point q, and the local increment value H(X j ) is formula (8):
k个最近邻点的距离和L(Xj)为公式(9):The distance sum L(X j ) of the k nearest neighbors is formula (9):
其中,表示数据点Xj在其所属簇中的k个最近邻点组成的k邻域;in, Represents the k neighborhood of the k nearest neighbors of the data point X j in the cluster to which it belongs;
增强型角度异常因子EAOF(Xj)为公式(10):The enhanced angular anomaly factor EAOF(X j ) is formula (10):
其中,o是数据点Xj所属簇的簇心,dist(o,Xj)为数据点Xj与其簇心的距离,表示Ci(i=1,2,…,m)簇内每个数据点相对于该簇的角度因子,H(Xj)是局部增量值;Among them, o is the cluster center of the cluster to which the data point X j belongs, dist(o, X j ) is the distance between the data point X j and its cluster center, Represents the angle factor of each data point in the cluster C i (i=1,2,...,m) relative to the cluster, and H(X j ) is the local increment value;
12)多重验证:对所有候选异常点进行多次验证,并将经过有限次验证仍表现为异常的候选异常点判决为确定异常点并输出保存,若在验证过程中表现为正常点,则将该异常点直接丢弃,这样做能增加异常检测准确率的效果。12) Multiple verification: Perform multiple verifications on all candidate abnormal points, and judge the candidate abnormal points that are still abnormal after limited verification as the determined abnormal points and output and save them. If they are normal points during the verification process, they will be The anomaly point is directly discarded, which can increase the effect of anomaly detection accuracy.
步骤1)中所述的处理是指数据采集终端采集的数据以流的形式缓存,并将缓存的数据划分成E0,E1,E2,......数据块,每一个数据块代表一个基础窗口,每个滑动窗口W包含ε=2个基本窗口,采用基础窗口和滑动窗口相结合,实现数据的插入和删除,基础窗口和滑动窗口相结合的过程如图4所示:在Ti时刻过渡到Ti+1时刻,滑动窗口由Wi滑到Wi+1,伴随着新基础窗口Ei+1的并入和历史基础窗口Ei-1的移除,同时,将Ti时刻Wi检测的候选异常点并入到Wi+1中进行多重验证。The processing described in step 1) means that the data collected by the data collection terminal is buffered in the form of a stream, and the buffered data is divided into E 0 , E 1 , E 2 , . . . data blocks, each data block The block represents a basic window, and each sliding window W contains ε=2 basic windows. The combination of the basic window and the sliding window is used to realize the insertion and deletion of data. The process of combining the basic window and the sliding window is shown in Figure 4: At the transition from time T i to time T i +1 , the sliding window is slid from Wi to Wi +1 , along with the merging of the new base window E i+1 and the removal of the historical base window E i-1 , at the same time, The candidate outliers detected by Wi at time T i are incorporated into Wi +1 for multiple verification.
步骤6)中所述的r邻域点集合的角度因子计算公式为公式(2):The calculation formula of the angle factor of the r neighborhood point set described in step 6) is formula (2):
步骤6)中所述的r邻域点集合的局部密度计算公式为公式(3):The formula for calculating the local density of the r neighborhood point set described in step 6) is formula (3):
其中Nr(p)是数据点p的r邻域,q是数据点p的r邻域集合中的任意一个数据点,局部密度与邻域数据点数以及所处的位置有关,邻域数据点数越多、越处于数据集的中心,局部密度就越大。 where N r (p) is the r neighborhood of the data point p, q is any data point in the r neighborhood set of the data point p, the local density is related to the number of neighborhood data points and their location, the number of neighborhood data points The more and the more in the center of the dataset, the greater the local density.
步骤7)中所述的相异度δ(xi)是对所有数据点的局部密度进行降序排序后得到,相异度δ(xi)的计算公式为公式(4):相异度是数据点间属于不同簇的可能性大小的度量,由给定的数据集S,根据步骤6)得到的局部密度,经降序排序后得到其中,{pi}表示局部密度的一个降序后原始下标序号,d(pi,pj)表示数据点pi与pj间的欧氏距离,则某一数据点pi的相异度可定义如下:The dissimilarity δ(x i ) described in step 7) is obtained by sorting the local densities of all data points in descending order, and the calculation formula of the dissimilarity δ(x i ) is formula (4): the dissimilarity is A measure of the likelihood that the data points belong to different clusters, from the given data set S, according to the local density obtained in step 6), obtained after sorting in descending order where {pi } represents the local density A descending original index number of , d(pi , p j ) represents the Euclidean distance between data points p i and p j , then the dissimilarity of a data point p i can be defined as follows:
其中pi和pj是对应数据点的序号,当i=1时,j≥2;当i≥2时,j<i。 Where p i and p j are the serial numbers of the corresponding data points, when i=1, j≥2; when i≥2, j<i.
步骤9)中所述的归属矩阵F=[f1,f2,...,fn],是用于记录数据点之间归属关系,各元素的表示公式为公式(6):The attribution matrix F=[f 1 , f 2 , .
其中,{pi}表示簇心因子τ(xi)降序排序后的原始下标序号。Among them, {pi } represents the original index number after the cluster center factor τ(x i ) is sorted in descending order.
步骤10)中所述的确定簇心并聚类是指先定义簇心序号为Ccenter_id,数据点的类标号为Ccluster_label,并初始化簇心序号为1,即Ccenter_id=1;同时将步骤8)中获取的簇心因子最大的数据点的类标号为1,即再根据步骤8)中得到的降序下标序号{pi},对整个数据集S进行条件遍历,若与所有点的距离均满足(其中,r是初始参数值邻域半径),则将该点重新定义为一个新的簇心,该点类标号增加1,据此得到所有的簇心;然后根据得到的簇心,再利用步骤9)中的归属矩阵F=[f1,f2,...,fn],对隶属于同一簇心的点贴上同一标签(即类标号),方法如下:通过步骤9)中得到的降序下标序号{pi},对整个数据集S进行条件遍历,如果pi非簇心,根据归属矩阵将对应的标签赋给pi,否则pi的标签就是其本身,最终将所有类标号相同的数据点组成集合,即簇,得到m(m=Ccenter_id)个簇C1,C2,...,Cm,完成对数据集S的聚类。Determining the cluster center and clustering as described in step 10) refers to first defining the cluster center sequence number as C center_id , the class label of the data point as C cluster_label , and initializing the cluster center sequence number as 1, that is, C center_id = 1; ), the class label of the data point with the largest cluster centroid factor is 1, that is, Then, according to the descending subscript number {pi } obtained in step 8), conditionally traverse the entire data set S, if and The distances of all points satisfy (where r is the neighborhood radius of the initial parameter value), then redefine the point as a new cluster center, increase the point class label by 1, and obtain all the cluster centers accordingly; and then use the obtained cluster center to use The attribution matrix F=[f 1 , f 2 , . The obtained descending subscript number {pi } is used to conditionally traverse the entire data set S. If pi is not a cluster center, according to the attribution matrix, The corresponding label is assigned to p i , otherwise the label of p i is itself, and finally all data points with the same class label are formed into a set, that is, a cluster, and m (m=C center_id ) clusters C 1 , C 2 , .. .,C m , completes the clustering of the dataset S.
步骤11)中所述是对聚类后的各个簇分别进行异常检测,异常检测的具体步骤如下:As described in step 11), anomaly detection is performed on each cluster after clustering, and the specific steps of anomaly detection are as follows:
①对于任意簇Ci(i=1,2,…,m),计算该簇内每个数据点相对于该簇的角度因子①For any cluster C i (i=1,2,...,m), calculate the angle factor of each data point in the cluster relative to the cluster
为公式(7): is formula (7):
其中,Ci(i=1,2,…,m)表示聚类后任意簇;Among them, C i (i=1,2,...,m) represents any cluster after clustering;
②计算簇中每个数据点相对于其空间r邻域内的局部增量值H(Xj)为公式(8):② Calculate the local increment value H(X j ) of each data point in the cluster relative to its spatial r neighborhood as formula (8):
局部增量是为了反映数据点在其所属簇空间邻域内数据点的密集程度,其中,表示了数据点Xj在其所属簇的r邻域内的数据点数The local increment is to reflect the density of data points in the spatial neighborhood of the cluster to which they belong, where, represents the r neighborhood of the data point X j in the cluster to which it belongs number of data points within
③根据步骤10)中确认的簇心,计算各数据点到所属簇的簇心之间的距离dist(o,Xj);③ According to the cluster center confirmed in step 10), calculate the distance dist(o, X j ) between each data point and the cluster center of the cluster to which it belongs;
④计算每个数据点与其k个最近邻点的距离和L(Xj)为公式(9):④ Calculate the distance and L(X j ) of each data point and its k nearest neighbors as formula (9):
其中,表示数据点Xj在其所属簇中的k个最近邻点组成的k邻域,k个最近邻点的距离和L(Xj)反映了数据点与周围数据点远近程度,这样是为了避免基于角度的异常因子出现类似于图6中B1存在的缺陷;in, Indicates the k neighborhood of the k nearest neighbors of the data point X j in the cluster to which it belongs. The distance and L(X j ) of the k nearest neighbors reflect the distance between the data point and the surrounding data points. This is to avoid The angle-based anomaly factor appears similar to the defect in B 1 in Figure 6;
⑤计算每个数据点的增强型角度异常因子EAOF(Xj)为公式(10):⑤ Calculate the enhanced angular anomaly factor EAOF(X j ) for each data point as formula (10):
其中,o是数据点Xj所属簇的簇心,dist(o,Xj)为数据点Xj与其簇心的距离,表示Ci(i=1,2,…,m)簇内每个数据点相对于该簇的角度因子,H(Xj)是局部增量值;增强型角度异常因子EAOF不仅具备角度度量方式在多维空间中优越的度量性能,而且引入了距离和密度的思想,弥补了传统的基于角度异常因子的缺陷;Among them, o is the cluster center of the cluster to which the data point X j belongs, dist(o, X j ) is the distance between the data point X j and its cluster center, Represents the angle factor of each data point in the C i (i=1,2,...,m) cluster relative to the cluster, and H(X j ) is the local incremental value; the enhanced angle anomaly factor EAOF not only has an angle measurement method Excellent measurement performance in multi-dimensional space, and the idea of distance and density is introduced, which makes up for the defects of traditional angle-based anomaly factors;
⑥由⑤中得到的所有数据点增强型角度异常因子,计算其均值μ及标准差δ,并用均值和标准差计算异常判决阈值θ,θ=μ+ξ·δ,其中ξ为初始设定的异常判决阈值调整系数;⑥ Calculate the mean value μ and standard deviation δ of the enhanced angle anomaly factor of all data points obtained in ⑤, and use the mean value and standard deviation to calculate the abnormal judgment threshold θ, θ=μ+ξ·δ, where ξ is the initial setting Abnormal judgment threshold adjustment coefficient;
⑦将⑤中得到的各点增强型角度异常因子EAOF(Xj)与⑥中得到判决阈值θ进行比较,若满足EAOF(Xj)>θ,则将该点标记为该簇中的候选异常对象,并存入本簇的候选异常点集合Oi中。⑦Compare the enhanced angular anomaly factor EAOF(X j ) obtained in ⑤ with the decision threshold θ obtained in ⑥. If EAOF(X j )>θ is satisfied, mark the point as a candidate anomaly in the cluster object, and stored in the candidate outlier set O i of this cluster.
本实施例提供了一种基于增强型角度异常因子的数据流异常检测及多重验证的方法,该方法采用了滑动窗口和基本窗口相结合的技术,构建了高效的实时数据流处理技术,同时引入了增强型角度异常因子,在解决传统方法内存占用率高,处理数据效率低的同时,又保证了方法的实时性好,异常检测准确率高及时间复杂度较低的优点。This embodiment provides a method for data stream anomaly detection and multiple verification based on an enhanced angle anomaly factor. The method adopts the technology of combining sliding window and basic window to construct an efficient real-time data stream processing technology. At the same time, it introduces The enhanced angle anomaly factor is adopted, which not only solves the traditional method's high memory occupancy rate and low data processing efficiency, but also ensures the method's good real-time performance, high anomaly detection accuracy and low time complexity.
为了验证本实施例方法的有效性,将通过仿真结果对比进一步说明:In order to verify the effectiveness of the method of this embodiment, it will be further explained by comparing the simulation results:
本实施例在人工生成的数据集和真实数据集中均进行了验证,并与传统方法I-IncLOF和Thakran等人提出的基于加权聚类的数据流非监督异常检测方法(简记为方法I)进行了对比,实验数据集信息如表1所示,表1为实验数据集信息,三个数据集分别是具有不同维度、不同数据量、不同数据特性的数据集。This embodiment has been verified in both artificially generated datasets and real datasets, and is in agreement with the traditional method I-IncLOF and the weighted clustering-based data stream unsupervised anomaly detection method (referred to as Method I) proposed by Thakran et al. For comparison, the experimental data set information is shown in Table 1. Table 1 is the experimental data set information. The three data sets are data sets with different dimensions, different data amounts, and different data characteristics.
人工数据集1的数据分布如图11a所示,共有1615个数据点,由5个簇和15个离散点组成,其中簇1是由高斯分布N1(u1,Σ1)生成的500个数据点组成,簇2是由高斯分布N2(u2,Σ2)生成的500个数据点组成,簇3是由高斯分布N3(u3,Σ3)生成的500个数据点组成,簇4和簇5分别是由高斯分布N4(u4,Σ4)和N5(u5,Σ5)生成的50个数据点组成,且由于N4和N5数据点数极少,因此被视作异常簇。同时,根据数据集的分布特性,随机生成了15个离散的异常点,因此该数据集共包含115个异常点,分布情况见图11b,异常点用圆圈标出,在实验过程中,异常簇和离散异常点随机混入正常簇中,以下是高斯分布生成数据集1所使用的参数:The data distribution of
μ1=[+1 +1],μ2=[-1 -1],μ3=[+1 -1],μ4=[-1 +1],μ5=[0 0]μ 1 =[+1 +1], μ 2 =[-1 -1], μ 3 =[+1 -1], μ 4 =[-1 +1], μ 5 =[0 0]
人工数据集2的数据分布如图12a所示,共有860个数据点,由3个正常簇和1个异常簇以及48个离散异常点组成,其中异常簇由21个异常点组成。因此该数据集共有69个异常点,异常点的分布情况如图12b所示。The data distribution of
真实数据集Breast Cancer如表1所示,该数据集来源于UCI机器学习库,包含699个数据点,有两个正常簇组成,其中,为了验证方法的有效性,本实施例根据均值、方差等统计特性向真实数据集中添加了34个异常点,用于异常检测的对比验证。The real data set Breast Cancer is shown in Table 1. The data set comes from the UCI machine learning library, contains 699 data points, and consists of two normal clusters. Statistical features such as 34 anomaly points are added to the real dataset for comparative verification of anomaly detection.
在本实施例方法的验证实验中,设置基础窗口的长度为20,两个基础窗口组成一个滑动窗口,且最近邻点数k=3,空间邻域半径定为当前时刻滑动窗口中数据点间距离值降序排列的前20%距离值的均值,异常判决阈值调整系数定为2.5,多重验证的次数定为3次,同时,选取最能体现异常检测方法有效性的检测率和误判率进行比较,如图11a-11d、图12a-12d所示,为数据集1和数据集2的可视化实验结果。In the verification experiment of the method of this embodiment, the length of the base window is set to 20, two base windows form a sliding window, and the number of nearest neighbors k=3, and the radius of the spatial neighborhood is set as the distance between data points in the sliding window at the current moment The average value of the top 20% of the distance values in descending order, the abnormal judgment threshold adjustment coefficient is set to 2.5, and the number of multiple verifications is set to 3 times. , as shown in Figures 11a-11d and Figures 12a-12d, which are the visual experimental results of
对于人工数据集1,从图11a-11d可以看出,运用该方法,2个异常簇和15个离散异常点都能被有效地检测出来,达到了零漏检的效果,从图11d可知,有3个正常点被错检为异常点,原因是这些正常点虽然是由正常的高斯分布生成的,但偏离正常簇稍远,在连续3次的多重验证中均表现为异常,因此被判定为异常点;For artificial data set 1, it can be seen from Figure 11a-11d that using this method, 2 abnormal clusters and 15 discrete abnormal points can be effectively detected, achieving the effect of zero missed detection. From Figure 11d, it can be seen that, There are 3 normal points that were wrongly detected as abnormal points. The reason is that although these normal points are generated by normal Gaussian distribution, they deviate from the normal cluster a little bit, and they are abnormal in 3 consecutive multiple verifications, so they are judged. is an abnormal point;
对于人工数据集2,从图12a-12d可以看出,在三维数据空间中,该方法仍然能保持很好的有效性,从图12b、12c、12d可以看出,异常簇中的点均能被检测出来,而在48个离散异常点中有47个被检测出来,有一个离散异常点漏检,漏检的原因是该漏检点离正常簇距离较近,使得在多重验证中某次表现为正常,因此被判定为正常点。For
在验证本实施例方法有效性的同时,还将本实施例方法与传统方法进行了对比,进一步验证了本实施例方法的优势,如表2所示,表2为实验结果统计信息,三个数据集上对三种方法进行对比实验的详细统计结果。从表2可以看出,本实施例提出的方法检测率高,误判率低,有效性明显优于另外两种方法,而且当数据集的维数越高时该方法的优越性越明显,方法I结合了W-K-Means与DBSCAN方法,并动态更新DBSCAN所需的参量和每一维的权重,因此方法I对动态数据流有很好的适应性,但由于其运用的是传统的基于距离和密度的异常度量方式,因此当维数增加时,有效性会降低;而I-IncLOF方法运用的是基于局部密度的思想,同样会受到维度灾难的影响,数据维数较低时其表现较好,但维数增多时,有效性较差。While verifying the effectiveness of the method in this embodiment, the method in this embodiment is compared with the traditional method to further verify the advantages of the method in this embodiment. As shown in Table 2, Table 2 is the statistical information of the experimental results. Detailed statistical results of comparative experiments on the three methods on the dataset. It can be seen from Table 2 that the method proposed in this example has a high detection rate, a low false positive rate, and is significantly more effective than the other two methods. Moreover, when the dimension of the data set is higher, the superiority of this method is more obvious. Method I combines the W-K-Means and DBSCAN methods, and dynamically updates the parameters required by DBSCAN and the weight of each dimension. Therefore, method I has good adaptability to dynamic data flow, but because it uses the traditional distance-based method Therefore, when the dimension increases, the effectiveness will be reduced; while the I-IncLOF method uses the idea of local density, which is also affected by the disaster of dimensionality, and its performance is better when the data dimension is low. Good, but less effective as the number of dimensions increases.
通过以上对不同数据集的验证及与传统方法的对比分析,可见,本实施例提出的基于增强型角度异常因子的数据流异常检测及多重验证的方法具有更好的有效性和可行性。Through the above verification of different data sets and comparative analysis with traditional methods, it can be seen that the method for data flow anomaly detection and multiple verification based on the enhanced angle anomaly factor proposed in this embodiment has better effectiveness and feasibility.
表1Table 1
表2Table 2
Claims (6)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710823063.0A CN107682319B (en) | 2017-09-13 | 2017-09-13 | A method for data flow anomaly detection and multiple verification based on enhanced angle anomaly factor |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710823063.0A CN107682319B (en) | 2017-09-13 | 2017-09-13 | A method for data flow anomaly detection and multiple verification based on enhanced angle anomaly factor |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN107682319A CN107682319A (en) | 2018-02-09 |
| CN107682319B true CN107682319B (en) | 2020-07-03 |
Family
ID=61136410
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201710823063.0A Expired - Fee Related CN107682319B (en) | 2017-09-13 | 2017-09-13 | A method for data flow anomaly detection and multiple verification based on enhanced angle anomaly factor |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN107682319B (en) |
Families Citing this family (20)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110311879B (en) * | 2018-03-20 | 2022-02-22 | 重庆邮电大学 | A Data Stream Anomaly Recognition Method Based on Random Projection Angle Distribution |
| CN108667684B (en) * | 2018-03-30 | 2021-04-30 | 桂林电子科技大学 | Data flow anomaly detection method based on local vector dot product density |
| CN109344171A (en) * | 2018-12-21 | 2019-02-15 | 中国计量大学 | A Saliency Mining Method for Characteristic Variables of Nonlinear Systems Based on Data Stream Processing |
| CN109978070A (en) * | 2019-04-03 | 2019-07-05 | 北京市天元网络技术股份有限公司 | A kind of improved K-means rejecting outliers method and device |
| CN112800101B (en) * | 2019-11-13 | 2024-09-17 | 中国信托登记有限责任公司 | Abnormal behavior detection method based on FP-growth algorithm and model applying method |
| CN111125470A (en) * | 2019-12-25 | 2020-05-08 | 成都康赛信息技术有限公司 | Method for improving abnormal data mining and screening |
| CN111680751B (en) * | 2020-06-09 | 2023-05-30 | 南京农业大学 | An Algorithm for Detecting Abnormal Data in Grain Yield Map |
| CN112286951B (en) * | 2020-11-26 | 2025-01-10 | 杭州数梦工场科技有限公司 | Data detection method and device |
| CN112381181B (en) * | 2020-12-11 | 2022-10-04 | 桂林电子科技大学 | Dynamic detection method for building energy consumption abnormity |
| CN113225391B (en) * | 2021-04-27 | 2022-11-08 | 东莞中山大学研究院 | Atmospheric environment monitoring quality monitoring method based on sliding window anomaly detection and computing equipment |
| CN113537061B (en) * | 2021-07-16 | 2024-03-26 | 中天通信技术有限公司 | Format identification method, device and storage medium for two-dimensional quadrature amplitude modulated signals |
| CN114925737A (en) * | 2021-12-27 | 2022-08-19 | 天翼数字生活科技有限公司 | Fitness proportion sharing-based stream data clustering method and system |
| CN114490603B (en) * | 2022-01-11 | 2024-09-24 | 燕山大学 | Mechanical flow type data cleaning method based on increment local abnormal factors |
| CN115271003B (en) * | 2022-09-30 | 2023-01-03 | 江苏云天新材料制造有限公司 | Abnormal data analysis method and system for automatic environment monitoring equipment |
| CN116089846B (en) * | 2023-04-03 | 2023-07-25 | 北京智蚁杨帆科技有限公司 | New energy settlement data anomaly detection and early warning method based on data clustering |
| CN116502169B (en) * | 2023-06-28 | 2023-08-22 | 深圳特力自动化工程有限公司 | Centrifugal dehydrator working state detection method based on data detection |
| CN116628729B (en) * | 2023-07-25 | 2023-09-29 | 天津市城市规划设计研究总院有限公司 | Method and system for improving data security according to data characteristic differentiation |
| CN117313957B (en) * | 2023-11-28 | 2024-02-27 | 威海华创软件有限公司 | Intelligent prediction method for production flow task amount based on big data analysis |
| CN118764440A (en) * | 2024-08-01 | 2024-10-11 | 广东技术师范大学 | A method, system and storage medium for detecting abnormality of streaming data |
| CN120910728B (en) * | 2025-10-11 | 2025-12-05 | 江苏光来科技有限公司 | Intelligent analysis method for operation data of numerical control machine tool |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103336906A (en) * | 2013-07-15 | 2013-10-02 | 哈尔滨工业大学 | Sampling GPR method of continuous anomaly detection in collecting data flow of environment sensor |
| CN103974311A (en) * | 2014-05-21 | 2014-08-06 | 哈尔滨工业大学 | Condition monitoring data stream anomaly detection method based on improved gaussian process regression model |
| CN104283737A (en) * | 2014-09-30 | 2015-01-14 | 杭州华为数字技术有限公司 | Data flow processing method and device |
| CN104809594A (en) * | 2015-05-13 | 2015-07-29 | 中国电力科学研究院 | Distribution network data online cleaning method based on dynamic outlier detection |
| CN104902509A (en) * | 2015-05-19 | 2015-09-09 | 浙江农林大学 | Abnormal data detection method based on top-k(sigma) algorithm |
| CN106506556A (en) * | 2016-12-29 | 2017-03-15 | 北京神州绿盟信息安全科技股份有限公司 | A kind of network flow abnormal detecting method and device |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7809513B2 (en) * | 2007-04-16 | 2010-10-05 | Acellent Technologies, Inc. | Environmental change compensation in a structural health monitoring system |
-
2017
- 2017-09-13 CN CN201710823063.0A patent/CN107682319B/en not_active Expired - Fee Related
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103336906A (en) * | 2013-07-15 | 2013-10-02 | 哈尔滨工业大学 | Sampling GPR method of continuous anomaly detection in collecting data flow of environment sensor |
| CN103974311A (en) * | 2014-05-21 | 2014-08-06 | 哈尔滨工业大学 | Condition monitoring data stream anomaly detection method based on improved gaussian process regression model |
| CN104283737A (en) * | 2014-09-30 | 2015-01-14 | 杭州华为数字技术有限公司 | Data flow processing method and device |
| CN104809594A (en) * | 2015-05-13 | 2015-07-29 | 中国电力科学研究院 | Distribution network data online cleaning method based on dynamic outlier detection |
| CN104902509A (en) * | 2015-05-19 | 2015-09-09 | 浙江农林大学 | Abnormal data detection method based on top-k(sigma) algorithm |
| CN106506556A (en) * | 2016-12-29 | 2017-03-15 | 北京神州绿盟信息安全科技股份有限公司 | A kind of network flow abnormal detecting method and device |
Non-Patent Citations (1)
| Title |
|---|
| 基于聚类的异常挖掘算法研究;苏晓珂;《中国博士学位论文全文数据库信息科技辑》;20110815(第08期);全文 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN107682319A (en) | 2018-02-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN107682319B (en) | A method for data flow anomaly detection and multiple verification based on enhanced angle anomaly factor | |
| CN108667684B (en) | Data flow anomaly detection method based on local vector dot product density | |
| CN112381181B (en) | Dynamic detection method for building energy consumption abnormity | |
| CN106383877B (en) | Social media online short text clustering and topic detection method | |
| CN110046665A (en) | Based on isolated two abnormal classification point detecting method of forest, information data processing terminal | |
| CN106021578B (en) | A kind of modified text classification algorithm based on cluster and degree of membership fusion | |
| CN106408939A (en) | Traffic flow sequence classification method based on density peak value clustering | |
| CN102176698A (en) | Method for detecting abnormal behaviors of user based on transfer learning | |
| CN105046214A (en) | On-line multi-face image processing method based on clustering | |
| CN108805213B (en) | Power load curve double-layer spectral clustering method considering wavelet entropy dimensionality reduction | |
| CN105718960A (en) | Image ordering model based on convolutional neural network and spatial pyramid matching | |
| CN110070121A (en) | A kind of quick approximate k nearest neighbor method based on tree strategy with balance K mean cluster | |
| CN108427713A (en) | A kind of video summarization method and system for homemade video | |
| CN114398350A (en) | Cleaning method and device for training data set and server | |
| CN108280236A (en) | A kind of random forest visualization data analysing method based on LargeVis | |
| CN112347842B (en) | Offline face clustering method based on association graph | |
| CN109886334A (en) | A kind of shared nearest neighbor density peak clustering method of secret protection | |
| CN115374851A (en) | Gas data anomaly detection method and device | |
| CN118552812B (en) | Active processing method, device, equipment and readable storage medium for mass point cloud | |
| CN116524723B (en) | Truck track anomaly identification method and system | |
| CN110458094B (en) | Equipment classification method based on fingerprint similarity | |
| CN109858507B (en) | A Rare Subsequence Mining Method for Multidimensional Time Series Data Applied to Air Pollution Control | |
| CN115604027A (en) | Network fingerprint recognition model training method, recognition method, device and storage medium | |
| CN108717444A (en) | A kind of big data clustering method and device based on distributed frame | |
| Purnawansyah et al. | K-Means clustering implementation in network traffic activities |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200703 |
|
| CF01 | Termination of patent right due to non-payment of annual fee |