CN110311879B - A Data Stream Anomaly Recognition Method Based on Random Projection Angle Distribution - Google Patents
A Data Stream Anomaly Recognition Method Based on Random Projection Angle Distribution Download PDFInfo
- Publication number
- CN110311879B CN110311879B CN201810227777.XA CN201810227777A CN110311879B CN 110311879 B CN110311879 B CN 110311879B CN 201810227777 A CN201810227777 A CN 201810227777A CN 110311879 B CN110311879 B CN 110311879B
- Authority
- CN
- China
- Prior art keywords
- data
- data set
- random
- sample
- method based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 230000002159 abnormal effect Effects 0.000 claims abstract description 50
- 238000001514 detection method Methods 0.000 claims abstract description 25
- 239000013598 vector Substances 0.000 claims description 19
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000005192 partition Methods 0.000 claims description 3
- 230000005856 abnormality Effects 0.000 claims 3
- 230000001788 irregular Effects 0.000 claims 1
- 238000007418 data mining Methods 0.000 description 2
- 230000002547 anomalous effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000013450 outlier detection Methods 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/06—Generation of reports
- H04L43/062—Generation of reports related to network traffic
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Complex Calculations (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A data flow anomaly identification method based on random projection angle distribution is used for rapidly detecting potential anomaly points in a data flow. By analyzing the characteristics of the data stream, the data stream anomaly identification method based on random projection angle distribution is provided to quickly detect the anomaly factor value corresponding to each data in the data stream. The data stream has the characteristics of large data volume, continuity, rapidness, fleeness and the like. In order to adapt to the characteristics, a random projection method and an angle anomaly detection method are provided to quickly detect abnormal points in the data stream, and finally, a dynamic sliding window method is used to improve the detection precision and adaptability of the algorithm. By adopting the method, the detection efficiency and precision of the algorithm can be effectively improved, and a theoretical basis is provided for rapidly detecting the abnormal points in the data stream in real time.
Description
Technical Field
The invention relates to technologies such as data mining and abnormal point detection, in particular to a data flow abnormal identification method based on random projection angle distribution.
Background
Under the background of the big data era, mass data are generated in the life of people at all times. Mining anomalous data in which deviating normal data objects are mined is one of the important tasks in the field of data mining. The anomaly detection is widely applied to various fields such as information security, intrusion detection, financial security and the like, and great research enthusiasm of researchers is aroused. Therefore, how to effectively and quickly find out valuable abnormal data is the most urgent and currently very meaningful research direction.
At present, researchers have proposed many abnormal point detection methods, mainly including an abnormal point detection method based on statistics, an abnormal point detection method based on distance, and an abnormal point detection method based on density. Where statistical-based outlier detection methods tend to depend largely on whether the data set satisfies a model of some probability distribution. The distance-based abnormal point detection method can only detect a low-dimensional data set, and is not suitable for a high-dimensional data set. The density-based abnormal point detection method requires a large number of parameters, but improper parameter selection can have a great influence on the abnormal point detection precision. Therefore, the traditional algorithm can not be well applied to high-dimensional data.
According to the current research situation, the traditional abnormal point detection method is difficult to adapt to high-dimensional data streams. The data stream has the characteristics of large data volume, continuity, rapidness, fleeness and the like. In order to adapt to the characteristics, the invention provides a data flow abnormity identification method based on random projection angle distribution. The method and the device can quickly and accurately identify the abnormal points in the data stream.
Disclosure of Invention
Aiming at the problems existing in the background, the invention provides a high-dimensional data flow abnormal point identification method to solve the problem that the traditional abnormal point detection method is not suitable for abnormal point detection on a high-dimensional data flow model.
The technical scheme adopted by the invention comprises the following steps:
the method comprises the following steps that (1) data in a data stream are collected in real time, an initial data set sample X with the size of an initial sliding window is obtained, and the data set sample X is preprocessed;
step (2) of taking random vector 1 t di,…,i∈IConstructing a random projection matrix, wherein the coordinates of a random vector obey standard normal distribution N (0,1), and projecting a data set sample X onto a hyperplane which is orthogonal to the random vector, wherein the X is obtained after projection;
step (3), calculating each data in X by combining a random projection method and an angle method, analyzing, and obtaining an abnormal factor value of each data;
step (4), analyzing the distribution condition of the current data set sample X according to the abnormal factor values of the elements in the data set X projected by the current window, calculating the density G of the window data set sample X, and if the density G at the current moment is greater than a set parameter L1, densely distributing the data set sample X and reducing m data in the current window; and if G is smaller than the set parameter L2, sparsely distributing the data set samples X, and adding the latest m data in the history window into the data set samples X in the current window. (ii) a
And (5) updating the size of the data set sample X and the sliding window, outputting the abnormal point, returning to the step (2), and continuously detecting the abnormal point.
Further, in the step (1), data in the data stream is collected in real time, the data is sequentially stored in the data set sample X, and when the data set sample X is full of data, data elements are preprocessed to avoid the influence of the data on the algorithm due to non-standardization, wherein the preprocessing comprises median standardization processing and normalization processing.
Further, the step (2) further comprises the following steps:
step (21) of obtaining a random vector 1 t di,…,i∈IConstructing a random projection matrix, wherein each vector coordinate is independently selected according to a standard normal distribution N (0, 1);
and (22) projecting the data set sample X onto a hyperplane which is orthogonal to the random vector to obtain a data set X.
Further, the step (3) approximately calculates an angle abnormal factor value of each element in the data set X by using an abnormal point detection method based on angle distribution according to the data partition after projection, and if the abnormal factor value is greater than a set threshold value T, it is determined as an abnormal point, otherwise, it is a normal point, where the formula of the abnormal point detection method based on angle distribution is as follows:
F(p)=Var[Θapb]=F2(p)-(F1(p))2 (1)
wherein f (P) in the formula (1) represents an abnormal factor value of the point P. In the formula (2)Andis composed of points on two sides of the point P in random projection. In the formula (3)Andrepresenting the product domain of P and its two side points, n representing the number of data elements in the data set sample X, and t representing the number of random projection vectors.
Further, the step (4) changes the size of the sliding window by analyzing the distribution of the abnormal degree analysis data set of the current data set sample X, and reduces the data set in the current window when the data set sample X is distributed more densely; and when the data set sample X is sparsely distributed, increasing the data set in the current window. The formula for judging the distribution condition of the data set is as follows:
wherein S in the formula (4) represents the size of the data set in the current sliding window,an outlier representing an element within the sliding window at the current time,representing outliers of elements within the historical time sliding window.
Further, after the step (5) updates the data set sample X and the sliding window, the above steps are repeated to realize the anomaly detection of the high-dimensional data stream.
When the data stream is detected at abnormal points, the high-dimensional data stream has the characteristics of large data volume, continuity, rapidness, short time, easiness in passing and the like. Compared with other current patents, the method disclosed by the invention reduces the calculation complexity of the abnormal point detection method based on angle distribution by using a random projection method, and reduces the time overhead of the algorithm; meanwhile, the distribution condition of the data set of the current window is analyzed, the size of the data set in the sliding window is dynamically adjusted, and the adaptability and the detection precision of the algorithm are improved.
Drawings
In order to make the object, technical scheme and beneficial effect of the invention more clear, the invention provides the following drawings for explanation:
fig. 1 is a flowchart of a data flow anomaly identification method based on random projection angle distribution according to the present invention.
Fig. 2 is a flowchart of the sliding window adjustment in step (4) of the present invention.
Detailed Description
The invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
The fast identification method of high-dimensional data stream anomalies according to one embodiment of the present invention is described in detail below with reference to fig. 1 and 2. The method comprises the following steps:
and (1) acquiring data in the data stream in real time, and sequentially storing the data into a data set sample X. Preprocessing of data elements begins when the dataset sample X is full. The influence of the data on the algorithm due to non-specification is avoided. The preprocessing comprises median normalization processing and normalization processing.
Selecting a projection vector to project the data set, wherein the method specifically comprises the following steps:
step (21) of obtaining a random vector 1 t di,…,i∈IConstructing a random projection matrix, wherein each vector coordinate is independently selected according to a standard normal distribution N (0, 1);
and (22) projecting the data set sample X onto a hyperplane which is orthogonal to the random vector to obtain a data set X.
And (3) according to the data partition after projection, approximately calculating an angle abnormal factor value of each element in the data set X by using an abnormal point detection method based on angle distribution, if the abnormal factor value is greater than a set threshold value T, determining that the abnormal point is an abnormal point, otherwise, determining that the abnormal point is a normal point, wherein the approximate calculation formula of the abnormal point detection method based on angle distribution is as follows:
F(p)=Var[Θapb]=F2(p)-(F1(p))2 (1)
wherein f (P) in the formula (1) represents an abnormal factor value of the point P. In the formula (2)Andis composed of points on two sides of the point P in random projection. In the formula (3)Anddenotes the product domain of P and its two side points, n denotes the number of data elements in the data sample X, and t denotes the number of random projection vectors.
Step (4) the size of the sliding window is changed according to the distribution condition of the abnormal degree analysis data set of the current window data set, and the data intensity G is calculated according to the abnormal degree of the current time window data set and the abnormal degree of the last historical time window data set; if the density G at the current moment is greater than the set parameter L1, the data set samples X are distributed more densely, and m data in the current window are reduced; and if G is smaller than the set parameter L2, sparsely distributing the data set samples X, and adding the latest m data in the history window into the data set samples X in the current window. The specific implementation steps are shown in fig. 2, wherein the formula for judging the distribution of the data set is as follows:
wherein S in the formula (4) represents the size of the data set in the current sliding window,an outlier representing an element within the sliding window at the current time,representing outliers of elements within the historical time sliding window.
And (5) updating the data set and the size of the sliding window, outputting the abnormal points, and repeating the steps to realize real-time detection of the abnormal points of the data stream.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810227777.XA CN110311879B (en) | 2018-03-20 | 2018-03-20 | A Data Stream Anomaly Recognition Method Based on Random Projection Angle Distribution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810227777.XA CN110311879B (en) | 2018-03-20 | 2018-03-20 | A Data Stream Anomaly Recognition Method Based on Random Projection Angle Distribution |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110311879A CN110311879A (en) | 2019-10-08 |
CN110311879B true CN110311879B (en) | 2022-02-22 |
Family
ID=68073841
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810227777.XA Active CN110311879B (en) | 2018-03-20 | 2018-03-20 | A Data Stream Anomaly Recognition Method Based on Random Projection Angle Distribution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110311879B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN120145286B (en) * | 2025-05-16 | 2025-08-19 | 深圳软银思创科技有限公司 | Abnormal data identification method, device, equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102369689A (en) * | 2011-08-11 | 2012-03-07 | 华为技术有限公司 | Long-term forecasting method and device of network flow |
CN104869105A (en) * | 2014-02-26 | 2015-08-26 | 重庆邮电大学 | Abnormal state online identification method |
CN105046275A (en) * | 2015-07-13 | 2015-11-11 | 河海大学 | Large-scale high-dimensional outlier data detection method based on angle variance |
CN106302487A (en) * | 2016-08-22 | 2017-01-04 | 中国农业大学 | Agricultural Internet of Things data flow anomaly detects processing method and processing device in real time |
CN107682319A (en) * | 2017-09-13 | 2018-02-09 | 桂林电子科技大学 | A Method of Data Flow Anomaly Detection and Multiple Verification Based on Enhanced Angle Anomaly Factor |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2012154657A2 (en) * | 2011-05-06 | 2012-11-15 | The Penn State Research Foundation | Robust anomaly detection and regularized domain adaptation of classifiers with application to internet packet-flows |
-
2018
- 2018-03-20 CN CN201810227777.XA patent/CN110311879B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102369689A (en) * | 2011-08-11 | 2012-03-07 | 华为技术有限公司 | Long-term forecasting method and device of network flow |
CN104869105A (en) * | 2014-02-26 | 2015-08-26 | 重庆邮电大学 | Abnormal state online identification method |
CN105046275A (en) * | 2015-07-13 | 2015-11-11 | 河海大学 | Large-scale high-dimensional outlier data detection method based on angle variance |
CN106302487A (en) * | 2016-08-22 | 2017-01-04 | 中国农业大学 | Agricultural Internet of Things data flow anomaly detects processing method and processing device in real time |
CN107682319A (en) * | 2017-09-13 | 2018-02-09 | 桂林电子科技大学 | A Method of Data Flow Anomaly Detection and Multiple Verification Based on Enhanced Angle Anomaly Factor |
Non-Patent Citations (2)
Title |
---|
A Near-linear Time Approximation Algorithm for Angle-based Outlier Detection in High-dimensional Data;Ninh Pham等;《KDD "12: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining》;20120831;全文 * |
对随机投影算法的离群数据挖掘技术研究;李桥等;《计算机工程与应用》;20131215;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110311879A (en) | 2019-10-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Mao et al. | A delay metric for video object detection: What average precision fails to tell | |
CN111242521A (en) | Track anomaly detection method and system | |
Li et al. | Visual abnormal behavior detection based on trajectory sparse reconstruction analysis | |
CN102881022B (en) | Concealed-target tracking method based on on-line learning | |
US9075713B2 (en) | Method for detecting anomalies in multivariate time series data | |
CN109359690B (en) | Vehicle travel track identification method based on checkpoint data | |
CN104699755B (en) | A kind of intelligent multiple target integrated recognition method based on data mining | |
CN114580572B (en) | Abnormal value identification method and device, electronic equipment and storage medium | |
CN103605362A (en) | Learning and anomaly detection method based on multi-feature motion modes of vehicle traces | |
CN105376260A (en) | Network abnormity flow monitoring system based on density peak value cluster | |
CN105975443A (en) | Lasso-based anomaly detection method and system | |
Tsintotas et al. | DOSeqSLAM: Dynamic on-line sequence based loop closure detection algorithm for SLAM | |
CN101976504A (en) | Multi-vehicle video tracking method based on color space information | |
CN102663775A (en) | Target tracking method oriented to video with low frame rate | |
CN113225391A (en) | Atmospheric environment monitoring quality monitoring method based on sliding window anomaly detection and computing equipment | |
Xu et al. | A lof-based method for abnormal segment detection in machinery condition monitoring | |
CN106935038B (en) | Parking detection system and detection method | |
CN119716843B (en) | Multi-rainy-area airport bird condition monitoring method and system based on bird detection radar | |
CN110311879B (en) | A Data Stream Anomaly Recognition Method Based on Random Projection Angle Distribution | |
Zhou et al. | DCOR: Dynamic channel-wise outlier removal to de-noise LiDAR data corrupted by snow | |
CN109389053B (en) | Method and system for detecting the position information of a vehicle to be tested around a target vehicle | |
Das et al. | Adaptive deviation learning for visual anomaly detection with data contamination | |
CN101877135A (en) | A Moving Object Detection Method Based on Background Reconstruction | |
CN109615007B (en) | A deep learning network target detection method based on particle filter | |
CN105809707B (en) | A kind of pedestrian tracting method based on random forests algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |