Background technique
With the fast development of Internet of Things, mobile Internet, social network, data rapid development is semi-structured and non-
The data of structuring are increased with exponential speed, and the channel of data source also gradually increases.Nowadays, with data huge explosion,
The mankind are substituted into a new epoch --- big data era again by the development of information technology.So-called big data, referring to can not
The data acquisition system that its content is grabbed, managed and handled with conventional software tool within a certain period of time has 4 V features:
The big Volume of data volume, the fast Velocity of pace of change, polymorphic type Variety and the low Value of value density.In fact, big number
Relatively accurately saying according to concept refer to big data technology, refer to mass data carry out it is different from the past, completely new, low at
This processing technique and the big data industry to grow up on this basis.The most representative property of big data be collect and
The user information from each Terminal Type and application is analyzed, by tissue or the intellectual analysis of research team, discovery is more valuable
Information.
Network flow is a kind of now common large data collection, becomes more and more difficult for the analysis of network flow.
Especially when occurring abnormal in network, the detection of abnormal flow has very big challenge.The exception of network flow can be net
The generation of network failure, the attack of safety provide good information to realize monitoring, alarm.Nowadays, network security problem has been
It is very important.On the whole, the abnormal flow that generating makes network that significant trouble occur has following some aspects: refusing first
Exhausted service attack, this is a kind of very harmful, extremely common attack pattern, referred to as DoS.It furthermore is distributed refusal clothes
Business attack, is also referred to as DDoS.Followed by network worm virus flow, and corresponding other abnormal flow.These networks
Abnormal flow, can cause backbone network deceleration, paralysis, have huge harm and destructive power.Main forms are
The obstruction of the occupancy of bandwidth, network can not send regular packet loss phenomenon etc. caused by normal data.For network, each
For server computer or even terminal system, exception flow of network will lead to the occupancy of a large amount of CPU time slices and memory headroom,
It can not normal response Demand and service.It for these problems, needs to construct the analysis system of exception of network traffic, carries out good pre-
Alert, alarm and flow processing function.
With being continuously increased for data volume, manually check more and more hard to carry on.Twitter, LinkedIn and
The current record per second of Facebook is more than the event of 12M.These quantity are increasing always, and are becoming increasingly prevalent,
The dynamic data source that machine generates, such as sensor, process and automated system are estimated to increase by 40% for data volume every year.But
Manual analysis and detectability are still limited, become increasingly can not by manually checking and analyzing these dynamic data sources.
Although the mankind cannot check these huge data flows manually, machine can be with.In order to provide to dynamic data
The response analysis in source, machine can use the data analysing method based on streaming, data be filtered, highlight and summarize, in number
Data are screened and summarized according to before reaching user.Since terminal user does not have each of manual analysis large data
As a result ability, therefore can be by playing the effectiveness of each result to the maximum extent using computing resource, to facilitate
Terminal user analyzes.That is, large data needs a kind of to help to identify number based on the data analysing method of streaming
According to and data trend.Nowadays, machine learning and statistical progress show to construct such data analysis side based on streaming
Method is possible.
Summary of the invention
The present invention is directed to current explosive growth, by the data volume of artificial detection more difficult, proposes a kind of streaming
Data exception detection method, by abnormality detection and data analysis combine, provide the flow sorter of data automatic classification with
And the interpreter of data characteristics is explained to user.On the other hand, stream data detection method is also contributed to the exception in network
Flow is detected.The exception flow of network detection method based on streaming that the present invention provides a kind of, described method and step is such as
Fig. 1, comprising:
1. data are extracted
Data extraction is divided into the following steps:
(1) the intercept network data on flows from network can use the tools such as tcpdump, WireShark, Tcptrace
It extracts.
(2) network flow data is stored in database, it is recommended to use postgreSQL database.
(3) connection with database is established, calling interface extracts data flow from database and analyzed.
(4) data set called in is handled, designs per-column data framework D, retain database table column name and
Column are extracted as array and directly operated to array by type, realize that data are transmitted automatically.
(5) set of data points to be processed is constructed, each data point includes measurement and attribute two parts.Measurement, which corresponds to, closes
Key performance indicator, such as source IP address, purpose IP address etc.;Attribute corresponds to metadata attributes, such as protocol type, source port
Number, destination slogan etc..
(6) utilization measure is to detect abnormal flow, and explains abnormal behaviour using attribute.
2. data characteristics is converted
After data extract completion, design feature transformation mechanism carries out series of features to the data in data framework D and turns
It changes, so that user can analyze various types of data.For example, IP address is converted to numeric type, statistical number by character string type
According to the probability for concentrating the source of each network flow, purpose IP address combination to occur, it is stored in times column.It is special that data are set
Levying conversion function allows user to carry out Coded Analysis to the data set of specific area, without modify subsequent classifier and
Interpreter enhances the practicability of anomalous traffic detection method.
3. sort operation
After data conversion, design flow sorter come the data point in data framework D is detected, sort operation.
The sort operation mark data points specified according to user, identify the abnormal flow in network, damp reservoir algorithm using adaptation
Realize the data analysis based on streaming.The present invention provides two kinds of classifiers, percentile classifier and Predicate Classification device.Percentile
The excessive data traffic of frequency of occurrence in network flow can be identified as exceptional value by number classifier;Predicate Classification device can be by net
A certain attribute data traffic identical with particular value is identified as exceptional value in network flow.Sort operation concrete methods of realizing is as follows:
(1) percentile classifier
Percentile in statistics refers to, one group of data is sorted from small to large, and calculates accumulative percentile accordingly,
Then the value of a certain percentile corresponding data is known as the percentile of this percentile.For example, pth percentile is such a
Value, it makes at least data item of p% be less than or equal to this value, and the data item of at least (100-p) % is greater than or waits
In this value.
Percentile classifier realizes process:
1) select single row as one group of metric first, select IP move towards the number times occurred here and arrange, to its into
Row percentile calculates.
2) each group of IP secondary numerical value for moving towards to occur is classified according to high and low value by specified percentile.It calculates
Obtain a high threshold and a Low threshold.
If 3) specify using high level as module, the value more than high threshold is set as 1.0, residual value 0.0.
If 4) specify using low value as module, the value that will be less than Low threshold is set as 1.0, residual value 0.0.
5) due to abnormal flow Producing reason first is that, a large amount of requests generate the obstructions of the occupancy of bandwidth, network, can not
It sends normal data and therefore sets high level as module.If it is excessive that one group of IP moves towards the number occurred, marked
For exceptional value.
6) it finally returns that a new data framework, increases by a column on the basis of data framework D, indicate the classification of every row
State: 1.0 be exceptional value, and 0.0 is normal value.
Citing is applied to network flow data collection: it is 0.7 that a judgement percentile is arranged first;Secondly selection times
Column are used as metric, calculate the percentile of each entry;Judge the percentile of data entry one by one again, is greater than 0.7
Data item be identified as exceptional value, be set as 1.0, the data item less than 0.7 is identified as normal value, is set as 0.0;Finally by contingency table
Knowledge is integrated into a column, is added to after last column of data framework, forms a new data framework, saves and in report
It shows.
(2) Predicate Classification device
Predicate Classification device is to be classified according to the mark value of predicate to data, and predicate, which has, to be equal to, is less than, being greater than.
Unlike percentile classifier, the exceptional value of Predicate Classification device label is not based in module column
Value is determining, but by user in configuration file self-defining, such as:
This will instantiate a Predicate Classification device, by " sa " column in each IP be equal to " 42.219.159.85 " be set to it is different
Constant value.Currently, Predicate Classification device temporarily only supports six kinds of different predicates: "==", "!=", " < ", " > ", " <=" and " >
=".
Citing is applied to network flow data and integrates: selecting source IP address as metric first;Secondly selection predicate "=
=", setting threshold value is specified IP address: " 42.219.159.85 ";Judge the source IP address of data entry one by one again, be equal to "
42.219.159.85 data item " is identified as exceptional value, is set as 1.0, and the data item not equal to " 42.219.159.85 " identifies
For normal value, it is set as 0.0;Class indication finally will be integrated into a column, be added to after last column of data framework, formed
One new data framework is saved and is shown in report.
(3) data distribution in addition in data flow changes over time, and the parameter in classifier should also update therewith.In order to
The Dynamic Response of streaming data is provided, proposes a kind of algorithm: adapting to damping reservoir algorithm, specific algorithm such as Fig. 2.Algorithm
The insertion process of data entry in data framework and decaying decision are separated, the decaying plan based on the time and based on tuple is allowed
Slightly, the Data Detection based on streaming is realized.Specific implementation step is as follows:
1) reservoir that an algorithm size given first is k, retains the data strip for being up to the present inserted into reservoir
Mesh operation counts cw。
2) it is inserted into data entry.When being inserted into data entry, if reservoir space is enough, then cwIncrease by 1, otherwise (3).
3) when being inserted into data entry, if reservoir insufficient space, entry is with k/cwRatio be put into reservoir, and will be with
Machine entry is evicted from from reservoir.
4) attenuation process.When decaying, algorithm, which counts operation, multiplies decay factor, cw:=(1- α) cw.It is stored up using damping is adapted to
Layer algorithm can maintain the stability of input data, realize the analysis detection of stream data, enhance practicability.
4. explaining operation
After data sorting operation, operation is explained, explains that operation is grouped and summarizes to multiple data points.It explains
The normal and abnormal behaviour of each set of properties.Relatively common set of properties in exceptional value is identified using relative risk ratio
It closes.Specifically, a combinations of attributes is given, a occurs in exceptional value entry0It is secondary, there is a in normal value entryiIt is secondary,
There is b in exceptional value entry in other combinations of attributes0It is secondary, there is b in normal value entryiIt is secondary, Hazard ratio is defined as:
Ratio=[a0/(a0+ai)]/[b0/(b0+bi)]
A possibility that if data point belongs to particular community combination, and the ratio data point of risk becomes exceptional value.Separately
Outside, the probability that a certain combinations of attributes occurs in exceptional value entry is described using abnormal support, explains operation specific implementation
Steps are as follows:
(1) all exceptional value entries and normal value entry are searched.
(2) combinations of attributes with minimum abnormal support is searched, and records minimum abnormal support.
(3) Hazard ratio of single attribute value is calculated, and records minimum risk ratio.
(4) find the combinations of attributes for meeting following condition: the abnormal support of its member property is greater than or equal to most small difference
Normal support, Hazard ratio are greater than or equal to minimum risk ratio.
(5) combinations of attributes in (4) is utilized to construct prefix trees in exceptional value entry.Prefix trees be it is a kind of comprising base and
The structure of two arrays of check.One attribute node of each element representation of base array, referred to as a state;Check array
Indicate forerunner's state of some state.Determine that finite automaton algorithm realizes the building of combinations of attributes using DFA, here prefix trees
It is presented in such a way that attribute successively decreases.
(6) combinations of attributes of (5) risk than being less than minimum risk ratio is filtered out.Finally obtain the wind of every attribute combination
Dangerous ratio.
It is operated by explaining, analyzing most probable becomes the combinations of attributes of exceptional value, and most unlikely becomes exceptional value
Combinations of attributes.
Citing is applied to network flow data collection: assuming that there is 1000000 datas, explaining that operation may find that label
Entry for exceptional value is 7860, and the entry labeled as normal value is 992140.With wherein having 4000 IP in exceptional value entry
Location is 42.219.159.85, then showing that the abnormal support of this IP address is 4000/7860*100%=50.9%;Normally
Having 816680 IP address in value entry is 42.219.159.85, then showing that the normal support of this IP address is 816680/
992140*100%=82.3%.Hazard ratio are as follows:
[4000/ (4000+816680)]/[3860/ (3860+175460)]=0.2264
It is low-risk ratio, therefore is construed to the entry that IP address is 42.219.159.85 that can not become exceptional value
Entry.By single attribute extension to multiple combinations of attributes, the risk-ratio of every attribute combination is finally obtained.
5. output report
It explains that the quantity that operation generates is still very big, is explained by the data that statistical interpretation operation generates, provide one
The explanation list of a sequence is sorted according to the degree that anomalous event occurs, and is generated static report and is shown to user.
Specific embodiment
Hardware environment of the invention is mainly a PC host.Wherein, the CPU of PC host is Intel (R) Core (TM)
I5-4570,3.20GHz inside save as 4GB RAM, 64 bit manipulation systems.
Software realization of the invention under Eclipse environment, is opened using ubuntu 16.04.1 as platform using Java language
Hair.Database uses postgreSQL.Java version is 1.8.0_161, and Eclipse version is 4.4.2, postgreSQL data
Library version is 10.4.
Experimental data is network flow header information, data memory format csv, tables of data format are as follows:
| Flow the timestamp terminated |
te |
The duration of stream |
td |
| Source IP address |
sa |
Target IP |
da |
| Source port |
sp |
Destination port |
dp |
| Agreement |
pr |
Mark |
flg |
| Forwarding state |
fwd |
Service type |
stos |
| The data packet exchanged in stream |
pkt |
Corresponding byte number |
byt |
| The number that every group of IP flows away to appearance |
times |
|
|
Specific example is as shown in Figure 3.
Detailed process is broadly divided into two parts, and first part is data sorting operation part, and second part is data explanation
Operation part.
1. sort operation part
(1) percentile classifier
Algorithm description
Algorithm input: M, Pe
Algorithm output: S
Illustrate: M is the module that user specifies, and is set as number times, P that every group of IP flows away to appearance hereeIt is hundred
Quantile, S are new data frameworks, the column comprising every row classification state.
Algorithm steps:
1) data and specified metric column M are passed to;
2) percentile P is utilizedeAlgorithm for estimating calculates the high threshold and Low threshold of measure column;
3) each of measure column value is compared with threshold value, if specified high level is standard, will be above high threshold
Value be set as abnormal 1.0, if specified low value position standard, the value that will be less than Low threshold is set as abnormal;
4) column that classification results deposit is new, and be added in data structure, return to new data framework S.
Its pseudocode is as follows:
(2) Predicate Classification device
Algorithm description
Algorithm input: M, Pr、L
Algorithm output: S
Illustrate: M is the module that user specifies, and is set as IP address, P hererIt is specified predicate, L is module
Value, S is new data framework, the column comprising every row classification state.
Algorithm steps:
1) data are passed to;
2) judge the data type of module value L;
3) according to specified predicate, each of module column value is compared with module value, meets item
Part is set as exceptional value 1.0;
4) column that classification results deposit is new, and be added in data structure, return to new data framework S
Its pseudocode is as follows:
2. explaining operation part
(1) operator is explained
Algorithm description
Algorithm input: r, s, O, I
Algorithm output: A, F, Ex
Illustrate: r is minimum risk ratio, and s is the minimum support of abnormal attribute combination, and O is abnormal entry set, and I is just
Normal entry set, A are combinations of attributes, and F is attribute prefix trees, ExIt is explanation results set.
Algorithm steps:
1) the exceptional value entry set O and normal value entry set I being passed to after classification;
2) minimum risk ratio r, minimum exception support s are set;
3) the combinations of attributes A of the condition of satisfaction is found;
4) combinations of attributes in 2) is utilized to construct prefix trees F;
5) explanation results set E is exportedx。
Its pseudocode is as follows: