CN109088903A

CN109088903A - A kind of exception flow of network detection method based on streaming

Info

Publication number: CN109088903A
Application number: CN201811315984.7A
Authority: CN
Inventors: 孟月芸; 孙建华; 陈浩; 刘利娜
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2018-12-25

Abstract

The invention relates to a method for detecting abnormal traffic in a flow-based network, which uses a flow classifier and an interpreter to detect abnormal traffic in the network. The method of the present invention includes: intercepting network flow data from the network, and importing it into a database; extracting data from the database, designing a data frame, and constructing a data point set; designing a feature conversion mechanism for the data; detecting and classifying the data points in the data frame Operation, classify network traffic data with stream classification operator; process stream data with adaptive damping reservoir algorithm; interpret classified data, analyze risk ratio with explain operator; classify interpretation results in static report The form is displayed to the user. Its process is shown in Figure 1. This method can be based on the detection and classification of abnormal traffic data in streaming networks, as well as the interpretation and analysis of abnormal data.

Description

A kind of exception flow of network detection method based on streaming

Technical field

The present invention relates to the analyses of the abnormal data of big data field, more particularly to the detection of exception flow of network.Specifically It is related to a kind of exception flow of network detection method based on streaming, utilizes the exception in flow sorter and interpreter detection network Flow.

Background technique

With the fast development of Internet of Things, mobile Internet, social network, data rapid development is semi-structured and non- The data of structuring are increased with exponential speed, and the channel of data source also gradually increases.Nowadays, with data huge explosion, The mankind are substituted into a new epoch --- big data era again by the development of information technology.So-called big data, referring to can not The data acquisition system that its content is grabbed, managed and handled with conventional software tool within a certain period of time has 4 V features: The big Volume of data volume, the fast Velocity of pace of change, polymorphic type Variety and the low Value of value density.In fact, big number Relatively accurately saying according to concept refer to big data technology, refer to mass data carry out it is different from the past, completely new, low at This processing technique and the big data industry to grow up on this basis.The most representative property of big data be collect and The user information from each Terminal Type and application is analyzed, by tissue or the intellectual analysis of research team, discovery is more valuable Information.

Network flow is a kind of now common large data collection, becomes more and more difficult for the analysis of network flow. Especially when occurring abnormal in network, the detection of abnormal flow has very big challenge.The exception of network flow can be net The generation of network failure, the attack of safety provide good information to realize monitoring, alarm.Nowadays, network security problem has been It is very important.On the whole, the abnormal flow that generating makes network that significant trouble occur has following some aspects: refusing first Exhausted service attack, this is a kind of very harmful, extremely common attack pattern, referred to as DoS.It furthermore is distributed refusal clothes Business attack, is also referred to as DDoS.Followed by network worm virus flow, and corresponding other abnormal flow.These networks Abnormal flow, can cause backbone network deceleration, paralysis, have huge harm and destructive power.Main forms are The obstruction of the occupancy of bandwidth, network can not send regular packet loss phenomenon etc. caused by normal data.For network, each For server computer or even terminal system, exception flow of network will lead to the occupancy of a large amount of CPU time slices and memory headroom, It can not normal response Demand and service.It for these problems, needs to construct the analysis system of exception of network traffic, carries out good pre- Alert, alarm and flow processing function.

With being continuously increased for data volume, manually check more and more hard to carry on.Twitter, LinkedIn and The current record per second of Facebook is more than the event of 12M.These quantity are increasing always, and are becoming increasingly prevalent, The dynamic data source that machine generates, such as sensor, process and automated system are estimated to increase by 40% for data volume every year.But Manual analysis and detectability are still limited, become increasingly can not by manually checking and analyzing these dynamic data sources.

Although the mankind cannot check these huge data flows manually, machine can be with.In order to provide to dynamic data The response analysis in source, machine can use the data analysing method based on streaming, data be filtered, highlight and summarize, in number Data are screened and summarized according to before reaching user.Since terminal user does not have each of manual analysis large data As a result ability, therefore can be by playing the effectiveness of each result to the maximum extent using computing resource, to facilitate Terminal user analyzes.That is, large data needs a kind of to help to identify number based on the data analysing method of streaming According to and data trend.Nowadays, machine learning and statistical progress show to construct such data analysis side based on streaming Method is possible.

Summary of the invention

The present invention is directed to current explosive growth, by the data volume of artificial detection more difficult, proposes a kind of streaming Data exception detection method, by abnormality detection and data analysis combine, provide the flow sorter of data automatic classification with And the interpreter of data characteristics is explained to user.On the other hand, stream data detection method is also contributed to the exception in network Flow is detected.The exception flow of network detection method based on streaming that the present invention provides a kind of, described method and step is such as Fig. 1, comprising:

1. data are extracted

Data extraction is divided into the following steps:

(1) the intercept network data on flows from network can use the tools such as tcpdump, WireShark, Tcptrace It extracts.

(2) network flow data is stored in database, it is recommended to use postgreSQL database.

(3) connection with database is established, calling interface extracts data flow from database and analyzed.

(4) data set called in is handled, designs per-column data framework D, retain database table column name and Column are extracted as array and directly operated to array by type, realize that data are transmitted automatically.

(5) set of data points to be processed is constructed, each data point includes measurement and attribute two parts.Measurement, which corresponds to, closes Key performance indicator, such as source IP address, purpose IP address etc.；Attribute corresponds to metadata attributes, such as protocol type, source port Number, destination slogan etc..

(6) utilization measure is to detect abnormal flow, and explains abnormal behaviour using attribute.

2. data characteristics is converted

After data extract completion, design feature transformation mechanism carries out series of features to the data in data framework D and turns It changes, so that user can analyze various types of data.For example, IP address is converted to numeric type, statistical number by character string type According to the probability for concentrating the source of each network flow, purpose IP address combination to occur, it is stored in times column.It is special that data are set Levying conversion function allows user to carry out Coded Analysis to the data set of specific area, without modify subsequent classifier and Interpreter enhances the practicability of anomalous traffic detection method.

3. sort operation

After data conversion, design flow sorter come the data point in data framework D is detected, sort operation. The sort operation mark data points specified according to user, identify the abnormal flow in network, damp reservoir algorithm using adaptation Realize the data analysis based on streaming.The present invention provides two kinds of classifiers, percentile classifier and Predicate Classification device.Percentile The excessive data traffic of frequency of occurrence in network flow can be identified as exceptional value by number classifier；Predicate Classification device can be by net A certain attribute data traffic identical with particular value is identified as exceptional value in network flow.Sort operation concrete methods of realizing is as follows:

(1) percentile classifier

Percentile in statistics refers to, one group of data is sorted from small to large, and calculates accumulative percentile accordingly, Then the value of a certain percentile corresponding data is known as the percentile of this percentile.For example, pth percentile is such a Value, it makes at least data item of p% be less than or equal to this value, and the data item of at least (100-p) % is greater than or waits In this value.

Percentile classifier realizes process:

1) select single row as one group of metric first, select IP move towards the number times occurred here and arrange, to its into Row percentile calculates.

2) each group of IP secondary numerical value for moving towards to occur is classified according to high and low value by specified percentile.It calculates Obtain a high threshold and a Low threshold.

If 3) specify using high level as module, the value more than high threshold is set as 1.0, residual value 0.0.

If 4) specify using low value as module, the value that will be less than Low threshold is set as 1.0, residual value 0.0.

5) due to abnormal flow Producing reason first is that, a large amount of requests generate the obstructions of the occupancy of bandwidth, network, can not It sends normal data and therefore sets high level as module.If it is excessive that one group of IP moves towards the number occurred, marked For exceptional value.

6) it finally returns that a new data framework, increases by a column on the basis of data framework D, indicate the classification of every row State: 1.0 be exceptional value, and 0.0 is normal value.

Citing is applied to network flow data collection: it is 0.7 that a judgement percentile is arranged first；Secondly selection times Column are used as metric, calculate the percentile of each entry；Judge the percentile of data entry one by one again, is greater than 0.7 Data item be identified as exceptional value, be set as 1.0, the data item less than 0.7 is identified as normal value, is set as 0.0；Finally by contingency table Knowledge is integrated into a column, is added to after last column of data framework, forms a new data framework, saves and in report It shows.

(2) Predicate Classification device

Predicate Classification device is to be classified according to the mark value of predicate to data, and predicate, which has, to be equal to, is less than, being greater than.

Unlike percentile classifier, the exceptional value of Predicate Classification device label is not based in module column Value is determining, but by user in configuration file self-defining, such as:

This will instantiate a Predicate Classification device, by " sa " column in each IP be equal to " 42.219.159.85 " be set to it is different Constant value.Currently, Predicate Classification device temporarily only supports six kinds of different predicates: "==", "！=", " < ", " > ", " <=" and " > =".

Citing is applied to network flow data and integrates: selecting source IP address as metric first；Secondly selection predicate "= =", setting threshold value is specified IP address: " 42.219.159.85 "；Judge the source IP address of data entry one by one again, be equal to " 42.219.159.85 data item " is identified as exceptional value, is set as 1.0, and the data item not equal to " 42.219.159.85 " identifies For normal value, it is set as 0.0；Class indication finally will be integrated into a column, be added to after last column of data framework, formed One new data framework is saved and is shown in report.

(3) data distribution in addition in data flow changes over time, and the parameter in classifier should also update therewith.In order to The Dynamic Response of streaming data is provided, proposes a kind of algorithm: adapting to damping reservoir algorithm, specific algorithm such as Fig. 2.Algorithm The insertion process of data entry in data framework and decaying decision are separated, the decaying plan based on the time and based on tuple is allowed Slightly, the Data Detection based on streaming is realized.Specific implementation step is as follows:

1) reservoir that an algorithm size given first is k, retains the data strip for being up to the present inserted into reservoir Mesh operation counts c_w。

2) it is inserted into data entry.When being inserted into data entry, if reservoir space is enough, then c_wIncrease by 1, otherwise (3).

3) when being inserted into data entry, if reservoir insufficient space, entry is with k/c_wRatio be put into reservoir, and will be with Machine entry is evicted from from reservoir.

4) attenuation process.When decaying, algorithm, which counts operation, multiplies decay factor, c_w:=(1- α) c_w.It is stored up using damping is adapted to Layer algorithm can maintain the stability of input data, realize the analysis detection of stream data, enhance practicability.

4. explaining operation

After data sorting operation, operation is explained, explains that operation is grouped and summarizes to multiple data points.It explains The normal and abnormal behaviour of each set of properties.Relatively common set of properties in exceptional value is identified using relative risk ratio It closes.Specifically, a combinations of attributes is given, a occurs in exceptional value entry₀It is secondary, there is a in normal value entry_iIt is secondary, There is b in exceptional value entry in other combinations of attributes₀It is secondary, there is b in normal value entry_iIt is secondary, Hazard ratio is defined as:

Ratio=[a₀/(a₀+a_i)]/[b₀/(b₀+b_i)]

A possibility that if data point belongs to particular community combination, and the ratio data point of risk becomes exceptional value.Separately Outside, the probability that a certain combinations of attributes occurs in exceptional value entry is described using abnormal support, explains operation specific implementation Steps are as follows:

(1) all exceptional value entries and normal value entry are searched.

(2) combinations of attributes with minimum abnormal support is searched, and records minimum abnormal support.

(3) Hazard ratio of single attribute value is calculated, and records minimum risk ratio.

(4) find the combinations of attributes for meeting following condition: the abnormal support of its member property is greater than or equal to most small difference Normal support, Hazard ratio are greater than or equal to minimum risk ratio.

(5) combinations of attributes in (4) is utilized to construct prefix trees in exceptional value entry.Prefix trees be it is a kind of comprising base and The structure of two arrays of check.One attribute node of each element representation of base array, referred to as a state；Check array Indicate forerunner's state of some state.Determine that finite automaton algorithm realizes the building of combinations of attributes using DFA, here prefix trees It is presented in such a way that attribute successively decreases.

(6) combinations of attributes of (5) risk than being less than minimum risk ratio is filtered out.Finally obtain the wind of every attribute combination Dangerous ratio.

It is operated by explaining, analyzing most probable becomes the combinations of attributes of exceptional value, and most unlikely becomes exceptional value Combinations of attributes.

Citing is applied to network flow data collection: assuming that there is 1000000 datas, explaining that operation may find that label Entry for exceptional value is 7860, and the entry labeled as normal value is 992140.With wherein having 4000 IP in exceptional value entry Location is 42.219.159.85, then showing that the abnormal support of this IP address is 4000/7860*100%=50.9%；Normally Having 816680 IP address in value entry is 42.219.159.85, then showing that the normal support of this IP address is 816680/ 992140*100%=82.3%.Hazard ratio are as follows:

[4000/ (4000+816680)]/[3860/ (3860+175460)]=0.2264

It is low-risk ratio, therefore is construed to the entry that IP address is 42.219.159.85 that can not become exceptional value Entry.By single attribute extension to multiple combinations of attributes, the risk-ratio of every attribute combination is finally obtained.

5. output report

It explains that the quantity that operation generates is still very big, is explained by the data that statistical interpretation operation generates, provide one The explanation list of a sequence is sorted according to the degree that anomalous event occurs, and is generated static report and is shown to user.

Detailed description of the invention

Fig. 1: system flow chart

Fig. 2: damping reservoir algorithm is adapted to

Fig. 3: network flow data format sample figure

Specific embodiment

Hardware environment of the invention is mainly a PC host.Wherein, the CPU of PC host is Intel (R) Core (TM) I5-4570,3.20GHz inside save as 4GB RAM, 64 bit manipulation systems.

Software realization of the invention under Eclipse environment, is opened using ubuntu 16.04.1 as platform using Java language Hair.Database uses postgreSQL.Java version is 1.8.0_161, and Eclipse version is 4.4.2, postgreSQL data Library version is 10.4.

Experimental data is network flow header information, data memory format csv, tables of data format are as follows:

Flow the timestamp terminated	te	The duration of stream	td
				Source IP address	sa	Target IP	da
Source port	sp	Destination port	dp
				Agreement	pr	Mark	flg
Forwarding state	fwd	Service type	stos
				The data packet exchanged in stream	pkt	Corresponding byte number	byt
The number that every group of IP flows away to appearance	times

Specific example is as shown in Figure 3.

Detailed process is broadly divided into two parts, and first part is data sorting operation part, and second part is data explanation Operation part.

1. sort operation part

(1) percentile classifier

Algorithm description

Algorithm input: M, P_e

Algorithm output: S

Illustrate: M is the module that user specifies, and is set as number times, P that every group of IP flows away to appearance here_eIt is hundred Quantile, S are new data frameworks, the column comprising every row classification state.

Algorithm steps:

1) data and specified metric column M are passed to；

2) percentile P is utilized_eAlgorithm for estimating calculates the high threshold and Low threshold of measure column；

3) each of measure column value is compared with threshold value, if specified high level is standard, will be above high threshold Value be set as abnormal 1.0, if specified low value position standard, the value that will be less than Low threshold is set as abnormal；

4) column that classification results deposit is new, and be added in data structure, return to new data framework S.

Its pseudocode is as follows:

(2) Predicate Classification device

Algorithm description

Algorithm input: M, P_r、L

Algorithm output: S

Illustrate: M is the module that user specifies, and is set as IP address, P here_rIt is specified predicate, L is module Value, S is new data framework, the column comprising every row classification state.

Algorithm steps:

1) data are passed to；

2) judge the data type of module value L；

3) according to specified predicate, each of module column value is compared with module value, meets item Part is set as exceptional value 1.0；

4) column that classification results deposit is new, and be added in data structure, return to new data framework S

Its pseudocode is as follows:

2. explaining operation part

(1) operator is explained

Algorithm description

Algorithm input: r, s, O, I

Algorithm output: A, F, E_x

Illustrate: r is minimum risk ratio, and s is the minimum support of abnormal attribute combination, and O is abnormal entry set, and I is just Normal entry set, A are combinations of attributes, and F is attribute prefix trees, E_xIt is explanation results set.

Algorithm steps:

1) the exceptional value entry set O and normal value entry set I being passed to after classification；

2) minimum risk ratio r, minimum exception support s are set；

3) the combinations of attributes A of the condition of satisfaction is found；

4) combinations of attributes in 2) is utilized to construct prefix trees F；

5) explanation results set E is exported_x。

Its pseudocode is as follows:

Claims

1. a kind of exception flow of network detection method based on streaming, it is characterised in that implementation steps are as follows:

(1) data are extracted, including are intercepted data on flows from network, deposit database, extracted data, design data frame, structure Make set of data points, setting six steps of metric and attribute value；

(2) data of extraction are subjected to Feature Conversion；

(3) sort operation is carried out to the data after conversion, identifies the exception item in network flow data, use percentile point Two kinds of flow sorters of class device and Predicate Classification device；

(4) sorted normal data and abnormal data analyzed, illustrated, application risk compares interpreter.

2. the exception flow of network detection method according to claim 1 based on streaming, it is characterised in that this method is in number Following six step is carried out according to the extraction stage:

(1) the intercept network data on flows from network；

(2) network flow data is stored in database；

(3) connection with database is established, calling interface extracts data flow from database and analyzed；

(4) data set called in is handled, designs per-column data framework；

(5) set of data points to be processed is constructed, each data point includes measurement and attribute two parts；

3. the exception flow of network detection method according to claim 1 based on streaming, it is characterised in that this method uses Flow sorter detects data stream, sort operation.The data based on streaming point are realized using damping reservoir algorithm is adapted to Analysis.The specific implementation step of classifier:

(1) the excessive data traffic of frequency of occurrence in network flow is identified as exceptional value using percentile classifier, specifically Implementation steps: selecting single row as one group of metric, carries out percentile calculating to it；It will be each by specified percentile The secondary numerical value that group IP moves towards to occur is classified according to high and low value, and a high threshold and a Low threshold is calculated；If specified Using high level as module, then the value more than high threshold is set as 1.0, residual value 0.0；If specified marked with low value for measurement Standard, the then value that will be less than Low threshold are set as 1.0, residual value 0.0；High level is set as module, if one group of IP is moved towards out Existing number is excessive, then is marked as exceptional value；

(2) attribute a certain in network flow data traffic identical with particular value is identified as exceptional value using Predicate Classification device, Specific implementation step: select single row as one group of metric；Predicate is selected, threshold value is set；The degree of data entry is judged one by one Magnitude, the data item to match with selected predicate and threshold value are identified as exceptional value, are set as 1.0, unmatched data item is identified as Normal value is set as 0.0；Class indication finally will be integrated into a column, be added to after last column of data framework, forms one A new data framework is saved and is shown in report；

(3) realize that the data based on streaming are analyzed using adaptation damping reservoir algorithm, specific implementation step: one size one of setting Fixed reservoir retains the entry for being inserted into reservoir；When being inserted into data entry, if reservoir space is enough, then entry Increase by 1；When being inserted into data entry, if reservoir insufficient space, entry is put into reservoir with certain ratio, and at random will Existing entry is evicted from from reservoir.

4. the exception flow of network detection method according to claim 1 or 3 based on streaming, it is characterised in that this method is adopted Multiple network flow data points after classification are grouped and are summarized with explaining to operate, each set of properties of analysis interpretation is just Normal and abnormal behaviour identifies relatively common combinations of attributes in exceptional value using risk-ratio, explains operation specific implementation Step:

(1) all exceptional value entries and normal value entry in network flow data are searched after classification；

(2) combinations of attributes in data on flows with minimum abnormal support is searched, and records minimum abnormal support；

(3) Hazard ratio of the single attribute value of data on flows is calculated, and records minimum risk ratio；

(4) find the combinations of attributes for meeting following condition: the abnormal support of its member property is greater than or equal to minimum abnormal branch Degree of holding, Hazard ratio are greater than or equal to minimum risk ratio；

(5) combinations of attributes in (4) is utilized to construct prefix trees in exceptional value entry, prefix trees are in such a way that attribute successively decreases here It presents；

(6) combinations of attributes of (5) risk than being less than minimum risk ratio is filtered out.Finally obtain the every attribute of network flow data Combined risk-ratio.