[go: up one dir, main page]

CN110311879B - A Data Stream Anomaly Recognition Method Based on Random Projection Angle Distribution - Google Patents

A Data Stream Anomaly Recognition Method Based on Random Projection Angle Distribution Download PDF

Info

Publication number
CN110311879B
CN110311879B CN201810227777.XA CN201810227777A CN110311879B CN 110311879 B CN110311879 B CN 110311879B CN 201810227777 A CN201810227777 A CN 201810227777A CN 110311879 B CN110311879 B CN 110311879B
Authority
CN
China
Prior art keywords
data
data set
random
sample
method based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810227777.XA
Other languages
Chinese (zh)
Other versions
CN110311879A (en
Inventor
朴昌浩
戴冲
马艺玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201810227777.XA priority Critical patent/CN110311879B/en
Publication of CN110311879A publication Critical patent/CN110311879A/en
Application granted granted Critical
Publication of CN110311879B publication Critical patent/CN110311879B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/06Generation of reports
    • H04L43/062Generation of reports related to network traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A data flow anomaly identification method based on random projection angle distribution is used for rapidly detecting potential anomaly points in a data flow. By analyzing the characteristics of the data stream, the data stream anomaly identification method based on random projection angle distribution is provided to quickly detect the anomaly factor value corresponding to each data in the data stream. The data stream has the characteristics of large data volume, continuity, rapidness, fleeness and the like. In order to adapt to the characteristics, a random projection method and an angle anomaly detection method are provided to quickly detect abnormal points in the data stream, and finally, a dynamic sliding window method is used to improve the detection precision and adaptability of the algorithm. By adopting the method, the detection efficiency and precision of the algorithm can be effectively improved, and a theoretical basis is provided for rapidly detecting the abnormal points in the data stream in real time.

Description

Data flow abnormity identification method based on random projection angle distribution
Technical Field
The invention relates to technologies such as data mining and abnormal point detection, in particular to a data flow abnormal identification method based on random projection angle distribution.
Background
Under the background of the big data era, mass data are generated in the life of people at all times. Mining anomalous data in which deviating normal data objects are mined is one of the important tasks in the field of data mining. The anomaly detection is widely applied to various fields such as information security, intrusion detection, financial security and the like, and great research enthusiasm of researchers is aroused. Therefore, how to effectively and quickly find out valuable abnormal data is the most urgent and currently very meaningful research direction.
At present, researchers have proposed many abnormal point detection methods, mainly including an abnormal point detection method based on statistics, an abnormal point detection method based on distance, and an abnormal point detection method based on density. Where statistical-based outlier detection methods tend to depend largely on whether the data set satisfies a model of some probability distribution. The distance-based abnormal point detection method can only detect a low-dimensional data set, and is not suitable for a high-dimensional data set. The density-based abnormal point detection method requires a large number of parameters, but improper parameter selection can have a great influence on the abnormal point detection precision. Therefore, the traditional algorithm can not be well applied to high-dimensional data.
According to the current research situation, the traditional abnormal point detection method is difficult to adapt to high-dimensional data streams. The data stream has the characteristics of large data volume, continuity, rapidness, fleeness and the like. In order to adapt to the characteristics, the invention provides a data flow abnormity identification method based on random projection angle distribution. The method and the device can quickly and accurately identify the abnormal points in the data stream.
Disclosure of Invention
Aiming at the problems existing in the background, the invention provides a high-dimensional data flow abnormal point identification method to solve the problem that the traditional abnormal point detection method is not suitable for abnormal point detection on a high-dimensional data flow model.
The technical scheme adopted by the invention comprises the following steps:
the method comprises the following steps that (1) data in a data stream are collected in real time, an initial data set sample X with the size of an initial sliding window is obtained, and the data set sample X is preprocessed;
step (2) of taking random vector 1 t di,…,i∈IConstructing a random projection matrix, wherein the coordinates of a random vector obey standard normal distribution N (0,1), and projecting a data set sample X onto a hyperplane which is orthogonal to the random vector, wherein the X is obtained after projection;
step (3), calculating each data in X by combining a random projection method and an angle method, analyzing, and obtaining an abnormal factor value of each data;
step (4), analyzing the distribution condition of the current data set sample X according to the abnormal factor values of the elements in the data set X projected by the current window, calculating the density G of the window data set sample X, and if the density G at the current moment is greater than a set parameter L1, densely distributing the data set sample X and reducing m data in the current window; and if G is smaller than the set parameter L2, sparsely distributing the data set samples X, and adding the latest m data in the history window into the data set samples X in the current window. (ii) a
And (5) updating the size of the data set sample X and the sliding window, outputting the abnormal point, returning to the step (2), and continuously detecting the abnormal point.
Further, in the step (1), data in the data stream is collected in real time, the data is sequentially stored in the data set sample X, and when the data set sample X is full of data, data elements are preprocessed to avoid the influence of the data on the algorithm due to non-standardization, wherein the preprocessing comprises median standardization processing and normalization processing.
Further, the step (2) further comprises the following steps:
step (21) of obtaining a random vector 1 t di,…,i∈IConstructing a random projection matrix, wherein each vector coordinate is independently selected according to a standard normal distribution N (0, 1);
and (22) projecting the data set sample X onto a hyperplane which is orthogonal to the random vector to obtain a data set X.
Further, the step (3) approximately calculates an angle abnormal factor value of each element in the data set X by using an abnormal point detection method based on angle distribution according to the data partition after projection, and if the abnormal factor value is greater than a set threshold value T, it is determined as an abnormal point, otherwise, it is a normal point, where the formula of the abnormal point detection method based on angle distribution is as follows:
F(p)=Var[Θapb]=F2(p)-(F1(p))2 (1)
Figure GDA0003313609290000021
Figure GDA0003313609290000022
wherein f (P) in the formula (1) represents an abnormal factor value of the point P. In the formula (2)
Figure GDA0003313609290000023
And
Figure GDA0003313609290000024
is composed of points on two sides of the point P in random projection. In the formula (3)
Figure GDA0003313609290000025
And
Figure GDA0003313609290000026
representing the product domain of P and its two side points, n representing the number of data elements in the data set sample X, and t representing the number of random projection vectors.
Further, the step (4) changes the size of the sliding window by analyzing the distribution of the abnormal degree analysis data set of the current data set sample X, and reduces the data set in the current window when the data set sample X is distributed more densely; and when the data set sample X is sparsely distributed, increasing the data set in the current window. The formula for judging the distribution condition of the data set is as follows:
Figure GDA0003313609290000031
wherein S in the formula (4) represents the size of the data set in the current sliding window,
Figure GDA0003313609290000032
an outlier representing an element within the sliding window at the current time,
Figure GDA0003313609290000033
representing outliers of elements within the historical time sliding window.
Further, after the step (5) updates the data set sample X and the sliding window, the above steps are repeated to realize the anomaly detection of the high-dimensional data stream.
When the data stream is detected at abnormal points, the high-dimensional data stream has the characteristics of large data volume, continuity, rapidness, short time, easiness in passing and the like. Compared with other current patents, the method disclosed by the invention reduces the calculation complexity of the abnormal point detection method based on angle distribution by using a random projection method, and reduces the time overhead of the algorithm; meanwhile, the distribution condition of the data set of the current window is analyzed, the size of the data set in the sliding window is dynamically adjusted, and the adaptability and the detection precision of the algorithm are improved.
Drawings
In order to make the object, technical scheme and beneficial effect of the invention more clear, the invention provides the following drawings for explanation:
fig. 1 is a flowchart of a data flow anomaly identification method based on random projection angle distribution according to the present invention.
Fig. 2 is a flowchart of the sliding window adjustment in step (4) of the present invention.
Detailed Description
The invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
The fast identification method of high-dimensional data stream anomalies according to one embodiment of the present invention is described in detail below with reference to fig. 1 and 2. The method comprises the following steps:
and (1) acquiring data in the data stream in real time, and sequentially storing the data into a data set sample X. Preprocessing of data elements begins when the dataset sample X is full. The influence of the data on the algorithm due to non-specification is avoided. The preprocessing comprises median normalization processing and normalization processing.
Selecting a projection vector to project the data set, wherein the method specifically comprises the following steps:
step (21) of obtaining a random vector 1 t di,…,i∈IConstructing a random projection matrix, wherein each vector coordinate is independently selected according to a standard normal distribution N (0, 1);
and (22) projecting the data set sample X onto a hyperplane which is orthogonal to the random vector to obtain a data set X.
And (3) according to the data partition after projection, approximately calculating an angle abnormal factor value of each element in the data set X by using an abnormal point detection method based on angle distribution, if the abnormal factor value is greater than a set threshold value T, determining that the abnormal point is an abnormal point, otherwise, determining that the abnormal point is a normal point, wherein the approximate calculation formula of the abnormal point detection method based on angle distribution is as follows:
F(p)=Var[Θapb]=F2(p)-(F1(p))2 (1)
Figure GDA0003313609290000041
Figure GDA0003313609290000042
wherein f (P) in the formula (1) represents an abnormal factor value of the point P. In the formula (2)
Figure GDA0003313609290000043
And
Figure GDA0003313609290000044
is composed of points on two sides of the point P in random projection. In the formula (3)
Figure GDA0003313609290000045
And
Figure GDA0003313609290000046
denotes the product domain of P and its two side points, n denotes the number of data elements in the data sample X, and t denotes the number of random projection vectors.
Step (4) the size of the sliding window is changed according to the distribution condition of the abnormal degree analysis data set of the current window data set, and the data intensity G is calculated according to the abnormal degree of the current time window data set and the abnormal degree of the last historical time window data set; if the density G at the current moment is greater than the set parameter L1, the data set samples X are distributed more densely, and m data in the current window are reduced; and if G is smaller than the set parameter L2, sparsely distributing the data set samples X, and adding the latest m data in the history window into the data set samples X in the current window. The specific implementation steps are shown in fig. 2, wherein the formula for judging the distribution of the data set is as follows:
Figure GDA0003313609290000047
wherein S in the formula (4) represents the size of the data set in the current sliding window,
Figure GDA0003313609290000048
an outlier representing an element within the sliding window at the current time,
Figure GDA0003313609290000049
representing outliers of elements within the historical time sliding window.
And (5) updating the data set and the size of the sliding window, outputting the abnormal points, and repeating the steps to realize real-time detection of the abnormal points of the data stream.

Claims (5)

1.一种基于随机投影角度分布的数据流异常识别方法,其特征在于,用于数据流异常点检测,包括步骤:1. a data stream anomaly identification method based on random projection angle distribution, is characterized in that, for data stream abnormal point detection, comprises the steps: 步骤(1)、实时采集数据流中的数据,获取初始滑动窗口大小的初始数据集样本X,对数据集样本X进行预处理;Step (1), collect the data in the data stream in real time, obtain the initial data set sample X of the initial sliding window size, and preprocess the data set sample X; 步骤(2)、取随机向量i 1,…, it∈ Id ,构建随机投影矩阵,其中随机向量的坐标服从标准正态分布N(0,1),将数据集样本X投影到与随机向量正交的超平面上,投影后的数据集为X*;Step (2), take random vectors i 1 ,..., i t ∈ I d , construct a random projection matrix, in which the coordinates of the random vectors obey the standard normal distribution N(0, 1), and project the dataset sample X to the random projection matrix. On the hyperplane orthogonal to the vector, the projected dataset is X*; 步骤(3)、结合随机投影方法和角度方法计算X*中的每一个数据进行分析,获取每一个数据的异常因子值;Step (3), combine random projection method and angle method to calculate each data in X* for analysis, and obtain the abnormal factor value of each data; 步骤(4)、根据当前窗口投影后的数据集X*中元素的异常因子值分析数据集样本X的分布情况,计算窗口数据集样本X的密集度G,若当前时刻的密集度G大于所设的参数L1时,则数据集样本X分布较密集,减少当前窗口内的m个数据;若G小于所设定的参数L2,则数据集样本X分布较稀疏,将历史窗口中最新的m个数据加入当前窗口内的数据集样本X;Step (4), analyze the distribution of the data set sample X according to the abnormal factor value of the elements in the data set X* after the current window projection, and calculate the density G of the window data set sample X, if the density G at the current moment is greater than all When the parameter L1 is set, the distribution of the data set samples X is denser, reducing m data in the current window; if G is less than the set parameter L2, the distribution of the data set samples X is sparse, and the latest m data in the history window is used. data is added to the dataset sample X in the current window; 步骤(5)、更新数据集样本X和滑动窗口大小,输出异常点,返回步骤(2),继续对异常点检测。Step (5), update the data set sample X and the size of the sliding window, output outliers, return to step (2), and continue to detect outliers. 2.根据权利要求1所述一种基于随机投影角度分布的数据流异常识别方法,其特征在于:所述步骤(1)中实时采集数据流中的数据获取初始滑动窗口大小数据,并把数据依次存储到数据集样本X中,当数据集样本X存满时开始对数据元素进行预处理,为避免数据因不规范对算法的影响,预处理包括中位数标准化处理、归一化处理。2. a kind of data stream anomaly identification method based on random projection angle distribution according to claim 1, is characterized in that: in described step (1), the data in real-time collection data stream obtains initial sliding window size data, and the data Store the data in the dataset sample X in turn. When the dataset sample X is full, the data elements are preprocessed. In order to avoid the influence of the irregular data on the algorithm, the preprocessing includes median normalization processing and normalization processing. 3.根据权利要求1所述一种基于随机投影角度分布的数据流异常识别方法,其特征在于:所述步骤(2)中选取随机向量对数据集样本X进行投影,首先选取随机向量i 1,…, it Id 构建随机投影矩阵,其中各向量坐标服从标准正态分布N(0,1)中独立选取;再将数据集样本X投影到与随机向量正交的超平面,得到数据集X*。3. a kind of data flow abnormal identification method based on random projection angle distribution according to claim 1, is characterized in that: in described step (2), choose random vector to project data set sample X, first choose random vector i 1 , ..., i t I d to construct a random projection matrix, in which the coordinates of each vector are independently selected from the standard normal distribution N(0,1); then the dataset sample X is projected onto the hyperplane orthogonal to the random vector, and we get dataset X*. 4.根据权利要求1所述一种基于随机投影角度分布的数据流异常识别方法,其特征在于:所述步骤(3)根据投影之后的数据分区,运用基于角度分布的异常点检测方法近似计算分析数据集X*中的每一个元素的角度异常因子值F,如果异常因子值大于设定的阈值T,则判定为异常点,反之则为正常点。4. a kind of data flow abnormal identification method based on random projection angle distribution according to claim 1, is characterized in that: described step (3), according to the data partition after projection, utilizes the abnormal point detection method based on angle distribution to approximate calculation Analyze the angle abnormality factor value F of each element in the data set X*, if the abnormality factor value is greater than the set threshold T, it is determined as an abnormal point, otherwise, it is a normal point. 5.根据权利要求1所述一种基于随机投影角度分布的数据流异常识别方法,其特征在于:所述步骤(4)根据当前窗口数据集的异常度分析数据集的分布情况来改变滑动窗口的大小,数据密集度G根据当前时刻窗口数据集的异常度和上一历史时刻窗口数据集的异常度计算。5. a kind of data flow anomaly identification method based on random projection angle distribution according to claim 1, is characterized in that: described step (4) changes sliding window according to the distribution situation of the abnormality analysis data set of current window data set The data density G is calculated according to the anomaly degree of the window dataset at the current moment and the anomaly degree of the window dataset at the previous historical moment.
CN201810227777.XA 2018-03-20 2018-03-20 A Data Stream Anomaly Recognition Method Based on Random Projection Angle Distribution Active CN110311879B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810227777.XA CN110311879B (en) 2018-03-20 2018-03-20 A Data Stream Anomaly Recognition Method Based on Random Projection Angle Distribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810227777.XA CN110311879B (en) 2018-03-20 2018-03-20 A Data Stream Anomaly Recognition Method Based on Random Projection Angle Distribution

Publications (2)

Publication Number Publication Date
CN110311879A CN110311879A (en) 2019-10-08
CN110311879B true CN110311879B (en) 2022-02-22

Family

ID=68073841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810227777.XA Active CN110311879B (en) 2018-03-20 2018-03-20 A Data Stream Anomaly Recognition Method Based on Random Projection Angle Distribution

Country Status (1)

Country Link
CN (1) CN110311879B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120145286B (en) * 2025-05-16 2025-08-19 深圳软银思创科技有限公司 Abnormal data identification method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102369689A (en) * 2011-08-11 2012-03-07 华为技术有限公司 Long-term forecasting method and device of network flow
CN104869105A (en) * 2014-02-26 2015-08-26 重庆邮电大学 Abnormal state online identification method
CN105046275A (en) * 2015-07-13 2015-11-11 河海大学 Large-scale high-dimensional outlier data detection method based on angle variance
CN106302487A (en) * 2016-08-22 2017-01-04 中国农业大学 Agricultural Internet of Things data flow anomaly detects processing method and processing device in real time
CN107682319A (en) * 2017-09-13 2018-02-09 桂林电子科技大学 A Method of Data Flow Anomaly Detection and Multiple Verification Based on Enhanced Angle Anomaly Factor

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012154657A2 (en) * 2011-05-06 2012-11-15 The Penn State Research Foundation Robust anomaly detection and regularized domain adaptation of classifiers with application to internet packet-flows

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102369689A (en) * 2011-08-11 2012-03-07 华为技术有限公司 Long-term forecasting method and device of network flow
CN104869105A (en) * 2014-02-26 2015-08-26 重庆邮电大学 Abnormal state online identification method
CN105046275A (en) * 2015-07-13 2015-11-11 河海大学 Large-scale high-dimensional outlier data detection method based on angle variance
CN106302487A (en) * 2016-08-22 2017-01-04 中国农业大学 Agricultural Internet of Things data flow anomaly detects processing method and processing device in real time
CN107682319A (en) * 2017-09-13 2018-02-09 桂林电子科技大学 A Method of Data Flow Anomaly Detection and Multiple Verification Based on Enhanced Angle Anomaly Factor

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Near-linear Time Approximation Algorithm for Angle-based Outlier Detection in High-dimensional Data;Ninh Pham等;《KDD "12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining》;20120831;全文 *
对随机投影算法的离群数据挖掘技术研究;李桥等;《计算机工程与应用》;20131215;全文 *

Also Published As

Publication number Publication date
CN110311879A (en) 2019-10-08

Similar Documents

Publication Publication Date Title
Mao et al. A delay metric for video object detection: What average precision fails to tell
CN111242521A (en) Track anomaly detection method and system
Li et al. Visual abnormal behavior detection based on trajectory sparse reconstruction analysis
CN102881022B (en) Concealed-target tracking method based on on-line learning
US9075713B2 (en) Method for detecting anomalies in multivariate time series data
CN109359690B (en) Vehicle travel track identification method based on checkpoint data
CN104699755B (en) A kind of intelligent multiple target integrated recognition method based on data mining
CN114580572B (en) Abnormal value identification method and device, electronic equipment and storage medium
CN103605362A (en) Learning and anomaly detection method based on multi-feature motion modes of vehicle traces
CN105376260A (en) Network abnormity flow monitoring system based on density peak value cluster
CN105975443A (en) Lasso-based anomaly detection method and system
Tsintotas et al. DOSeqSLAM: Dynamic on-line sequence based loop closure detection algorithm for SLAM
CN101976504A (en) Multi-vehicle video tracking method based on color space information
CN102663775A (en) Target tracking method oriented to video with low frame rate
CN113225391A (en) Atmospheric environment monitoring quality monitoring method based on sliding window anomaly detection and computing equipment
Xu et al. A lof-based method for abnormal segment detection in machinery condition monitoring
CN106935038B (en) Parking detection system and detection method
CN119716843B (en) Multi-rainy-area airport bird condition monitoring method and system based on bird detection radar
CN110311879B (en) A Data Stream Anomaly Recognition Method Based on Random Projection Angle Distribution
Zhou et al. DCOR: Dynamic channel-wise outlier removal to de-noise LiDAR data corrupted by snow
CN109389053B (en) Method and system for detecting the position information of a vehicle to be tested around a target vehicle
Das et al. Adaptive deviation learning for visual anomaly detection with data contamination
CN101877135A (en) A Moving Object Detection Method Based on Background Reconstruction
CN109615007B (en) A deep learning network target detection method based on particle filter
CN105809707B (en) A kind of pedestrian tracting method based on random forests algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant