WO2017061893A1

WO2017061893A1 - Method and system for automatic discovery of network usage patterns

Info

Publication number: WO2017061893A1
Application number: PCT/RU2015/000657
Authority: WO
Inventors: Alexander Alexeevich SEROV; Valery Nikolaevitch GLUKHOV; Hongbo Zhang
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2015-10-09
Filing date: 2015-10-09
Publication date: 2017-04-13
Anticipated expiration: 2018-04-09

Abstract

The disclosure relates to a method (100) for automatic discovery of network usage patterns, including the following steps: parsing (101 ) an input stream of network packets from a communication network according to a predetermined source network address and a predetermined destination network address; determining (102) a set of source network address packet instances based on the parsed input stream of network packets with respect to the predetermined source network address and a set of destination network address packet instances based on the parsed input stream of network packets with respect to the predetermined destination network address; determining (103) a matrix of statistics (T(i,j)) based on statistical evaluation of the set of source network address packet instances versus the set of destination network address packet instances; determining (104) a network usage decomposition matrix (Z(i,j)) based on statistical evaluation of the matrix of statistics (T(i,j)) with respect to service port information indicated in the parsed input stream of network packets; determining (105) a set of support groups based on statistical evaluation of the network usage decomposition matrix (Z(i,j)); and assigning (106) the set of source network address packet instances to a respective support group of the set of support groups based on a distance of the network usage decomposition matrix (Z(i,j)) with respect to the respective support group.

Description

Method and system for automatic discovery of network usage patterns

TECHNICAL FIELD The present disclosure relates to a method and a system for automatic discovery of network usage patterns, in particular with respect to the field of automatic analysis of telecommunication networks.

BACKGROUND

The task of managing and securing of modern wired- and wireless computer networks is a challenging problem. This challenge arises due to the scale and complexity of events in these networks and to the high dynamics of network state. High scale and complexity comes both from a large number of heterogeneous hosts and devices in the network and from a wide range of diverse applications running on computers.

Today network management involves the permanent work of highly skilled analysts who are familiar with features of software and hardware use in controlled segment of computer network. Modern techniques used for automatic managing of networks are based on the results of manual processing of data by analysts. These analysts must be able to catch network behavior patterns. Based on their specialization they must be able, for example, to extract the set of features characterizing fraudulent behavior that uses network accessible resources. The development of the field of network technology is currently very active. The emergence of new network communication protocols and new architectural solutions, such as computing or networking distributed application architecture, complicates the analysis and management of network resources. Over the last years, traffic in global networks is growing very fast. The introduction of voice, video and other real-time applications has changed the way both local and global networks are used. This has triggered the need to change the handling of network traffic.

The management of telecommunications networks today is highly dependent on the effectiveness of tools used to automate the analysis of network events. Today it's impossible to perform effective monitoring and control of telecommunication networks without the use of automatic analysis tools. The degree of the automation of this analysis is one of key technical problems in the field of telecommunications. The growing rate of network events decreases the effectiveness of off-line analysis tools. The use of network real-time applications must be accompanied by the development of real-time network analysis tools.

There are various hard problems in the field of automatic analysis of telecommunication networks, such as revealing the set of activity patterns to automate the extraction of network activities and the detection of changes in the network; and mapping dynamic interaction to static information and providing a utility for the summarization, characterization, profiling and tracking of activities within the network.

Modern tools used for automatic analysis of telecommunication networks should be able to perform actions in real time mode on the basis of traffic processing; they should be able to perform multi-parametric analysis of telecommunication networks. The use of multi- parametric analysis is necessary for the adaptation of granularity of analysis. They should be able to process data streams on different levels of hierarchy of the analyzed networks. Simultaneous processing of data streams on different levels of network hierarchy may be necessary to have the full understanding of events in the controlled network. Modern tools should further be able to perform the extraction of models of network resources usage in automatic mode. Extraction of patterns in automatic mode in real time may be necessary for timely prediction of technical problems or fraudulent behavior.

Drawbacks of existing methods used for the analysis of network traffic show an absence of scalability, an absence of universality, a presence of a stage of manual data processing, an absence of adaptability and an impossibility of simultaneous detection of both individual and group types of network resource usage. Regarding the absence of scalability, a significant number of methods are intended for the analyses that must be made on a definite level of hierarchy of telecommunication network, for example, host- level- or network-level analysis. Regarding the absence of universality, it is ordinary that a method is developed to solve some definite problem of network traffic analysis, for example to analyze the dependence of traffic upon the time and it is impossible to reconfigure tools implementing this method for another type of analysis. This feature leads to quite narrow field of application of such methods. With regard to the presence of the stage of manual data processing, manual data processing increases the accuracy of the results, but it reduces the overall efficiency of the use of analysis tools. With regard to the absence of adaptability of methods used to process data streams, this feature leads to inability to fully automate the process of data stream analysis. Hence, there is a need for an analysis system of network traffic without the above described drawbacks.

SUMMARY

It is the object of the invention to increase the efficiency of network analysis tools. A further object of the invention is to provide low-complexity tools for the analysis of events in telecommunications networks. This object is achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

The disclosure has its scope in the development of scalable and universal tools which may be applied both for host-level analysis and network-level analysis. The ability to reconfiguration makes possible to use similar network analysis tools for the solution of quite different practical problems. The same tools, for example, may be used both for the detection of fraudulent use of network resources and for the identification of behavior patterns that characterize similar use of network resources by large groups of users. Implementation of current invention will increase the degree of automation of tools used by the personnel responsible for the management of network.

The concept as described in this disclosure is to provide an automatic identification of patterns characterizing the current state of the use of resources which are available in the analyzed segment of network and the detailing of information about these patterns. In particular, the concept consists in calculation of a decomposition matrix which describes main patterns of the usage of network resources. This matrix is calculated as a result of the analysis of network traffic filtered on the basis of preliminary choice of two fields of IP packet header. A matrix of behavior decomposition is used for the clustering of IP packets and final identification of groups. Each group has the unique pattern of usage of network resources. The disclosure is based on the following basic assumptions: Network usage analysis is performed on the basis of processing of data fields included in the header of network packets. Input data to be processed is represented in the form of matrix: Src versus Dst. Here Src is a set of instances of type 1 and Dst is the set of instances of type 2. Each type of instances is based on the use of the mask: SrcName, DstName. Each mask characterizes one field of the header of network packet. Input data is processed aiming the construction of the set of basic patterns of the use of network resources. This set of patterns completely characterizes the analyzed data set. Each pattern consists of a set of weight coefficients. Each coefficient characterizes the participation of the definite Dst instance in above mentioned usage pattern. Numerical analysis is carried out by the calculation of a usage decomposition matrix. This matrix describes the degree of similarity in the behavior of different instances of Src, which is represented in input data. In order to describe the invention in detail, the following terms, abbreviations and notations will be used:

IP: Internet Protocol

Src: Source network address, e.g. source IP address

Dst: Destination network address, e.g. destination IP address

T(i,j): Matrix of statistics

PAM: Port Activity Matrix

D(i,j): Matrix of distances, Distance

Z(i,j): Matrix of usage decomposition

simCoeff(i): Vector of similarity coefficients

In the following, systems, devices and methods for discovery of network usage patterns are described. A network usage pattern is a pattern characterizing the current state of the use of resources which are available in the analyzed segment of the network and the detailing of information about this pattern.

The systems, devices and methods as described in this disclosure can be used in a very wide range of network analysis applications, for example: Automatic identification of patterns characterizing network behavior of users. Automatic profiling of these patterns; Automatic detection of situations which characterize high risk of network attacks; Automatic detection of unauthorized intruders in the network; Automatic detection of the cases characterizing the fraudulent use of hardware or software tools; and Automatic detection of situations which characterize high risk of failure in the monitored network segment.

According to a first aspect, the invention relates to a method for automatic discovery of network usage patterns, the method comprising: parsing an input stream of network packets from a communication network according to a predetermined source network address and a predetermined destination network address; determining a set of source network address packet instances based on the parsed input stream of network packets with respect to the predetermined source network address and a set of destination network address packet instances based on the parsed input stream of network packets with respect to the predetermined destination network address; determining a matrix of statistics based on statistical evaluation of the set of source network address packet instances versus the set of destination network address packet instances; determining a network usage decomposition matrix based on statistical evaluation of the matrix of statistics with respect to service port information indicated in the parsed input stream of network packets; determining a set of support groups based on statistical evaluation of the network usage decomposition matrix; and assigning the set of source network address packet instances to a respective support group of the set of support groups based on a distance of the network usage decomposition matrix (Z(i,j)) with respect to the respective support group.

Such a method allows creation of new means for controlling communication networks, in particular telecommunication networks. These new means allow online automatic extraction of models of network resources usage at different levels of the hierarchy of the analyzed telecommunications networks; Multi-parameter analysis of telecommunication networks, carried out in real time on the basis of traffic processing; and application of adaptive methods for automatic control and management of telecommunication networks. Thus, the method provides an efficient tool for network analysis of low complexity which can be used for the analysis of events in telecommunications networks.

In a first possible implementation form of the method according to the first aspect, the method comprises: determining a port activity matrix based on statistical evaluation of the matrix of statistics with respect to the service port information indicated in the parsed input P T/RU2015/000657

stream of network packets; and determining the network usage decomposition matrix based on a decomposition of the port activity matrix.

This provides the advantage that the port activity matrix is an efficient tool for monitoring activity on the network ports.

In a second possible implementation form of the method according to the first

implementation form of the first aspect, the method comprises: determining the network usage decomposition matrix based on a singular value decomposition of the port activity matrix into a first ancillary matrix, a decomposed matrix and a second ancillary matrix.

This provides the advantage that by using the singular value decomposition, the resulting matrices can be efficiently processed. In a third possible implementation form of the method according to the second

implementation form of the first aspect, the method comprises: determining a projection matrix based on a projection of the matrix of statistics by using the second ancillary matrix. This provides the advantage that by using the projection matrix the projection of the input matrix of statistics can be efficiently performed. Thus, the method has reduced complexity.

In a fourth possible implementation form of the method according to the third

implementation form of the first aspect, the method comprises: determining a matrix of distances based on a distance of the matrix of statistics with respect to the projection matrix.

This provides the advantage that by using the matrix of distances, network patterns can be efficiently resolved.

In a fifth possible implementation form of the method according to the fourth

implem first aspect, the distance is according to the following relation:

where Ty is the matrix of statistics, F_y is the projection matrix, i is the index of source network address packet instance and j is the index of destination network address packet instance. This provides the advantage that the matrix of distances is easy to calculate when using such a L2 norm.

In a sixth possible implementation form of the method according to the fourth

implementation form or the fifth implementation form of the first aspect, the method comprises: determining the network usage decomposition matrix based on the matrix of distances.

This provides the advantage that the matrix of distances provides an efficient way to determine the network usage decomposition matrix.

In a seventh possible implementation form of the method according to the sixth

implementation form of the first aspect, the method comprises: determining the network usage decom osition matrix according to the following relation:

where T is the matrix of statistics, D, is the distance, i is the index of source network address packet instance and j is the index of destination network address packet instance.

This provides the advantage that the network usage decomposition can efficiently be computed by using that L2 norm with respect to the matrix of statistics.

In an eight possible implementation form of the method according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, determining the set of support groups comprises: determining a covariance matrix based on the network usage decomposition matrix.

This provides the advantage that the covariance matrix represents statistical properties of the system. In a ninth possible implementation form of the method according to the eighth

implementation form of the first aspect, the method comprises: determining a number of support groups in the set of support groups based on a sum of the covariance matrix with respect to a number of destination network address packet instances.

This provides the advantage that computing the covariance matrix is sufficient for supporting the support groups. The method is highly efficient and has a low complexity.

In a tenth possible implementation form of the method according to the eighth

implementation form or the ninth implementation form of the first aspect, the method comprises: determining a number of support groups in the set of support groups based on the followin relation:

wherein the number of support groups is equal to the number of different values of a support group array simCoeffi with respect to a number i of source network address packet instances. With other words, the number of support groups is equal to the number of different values of simCoeff. This provides the advantage that the number of support groups can be efficiently determined by computing a sum over the covariance matrix.

In an eleventh possible implementation form of the method according to the tenth implementation form of the first aspect, the method comprises: assigning the set of source network address packet instances to a respective support group based on a distance of the support group array simCoeffi with respect to an array of support group markers Mark,.

This provides the advantage that the distance to an array of support group markers is an accurate measure for computing the number of support groups.

In a twelfth possible implementation form of the method according to the eleventh implementation form of the first aspect, the assigning is based on the following relation:

U2015/000657

wherein Mark is the array of support group markers and ε is the value of a calculation error.

This provides the advantage that this L1 measure can be easily computed. The comparison with a threshold is also easy to implement. That is, the method is highly efficient and has a low complexity.

In a thirteenth possible implementation form of the method according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the method comprises: determining a center of a respective support group as a vector of mean values of coordinates of the source network address packet instances included in the respective support group; and checking if source network address packet instances assigned to a respective support group have a predetermined distance to the center of the respective support group.

This provides the advantage that by using a reference to the center, an assignment to a respective support group can be accurately performed.

According to a second aspect, the inventions relates to a system for automatic discovery of network usage patterns, the system comprising: a preprocessing subsystem for parsing an input stream of network packets from a communication network according to a predetermined source network address and a predetermined destination network address; an input buffer processing subsystem for determining a set of source network address packet instances based on the parsed input stream of network packets with respect to the predetermined source network address and a set of destination network address packet instances based on the parsed input stream of network packets with respect to the predetermined destination network address; a statistical data processing subsystem for determining a matrix of statistics based on statistical evaluation of the set of source network address packet instances versus the set of destination network address packet instances; and for determining a network usage decomposition matrix based on statistical evaluation of the matrix of statistics with respect to service port information indicated in the parsed input stream of network packets; and a usage patterns identification subsystem for determining a set of support groups based on statistical evaluation of the network usage decomposition matrix; and for assigning the set of source network address packet instances to a respective support group of the set of support groups based on a distance of the network usage decomposition matrix with respect to the respective support group.

Such a system allows creation of new means for controlling communication networks, in particular telecommunication networks. These new means allow online automatic extraction of models of network resources usage at different levels of the hierarchy of the analyzed telecommunications networks; Multi-parameter analysis of telecommunication networks, carried out in real time on the basis of traffic processing; and application of adaptive methods for automatic control and management of telecommunication networks. Thus, the system provides an efficient tool for network analysis of low complexity which can be used for the analysis of events in telecommunications networks.

According to a third aspect, the invention relates to a computer implemented method for automatic discovery of network usage patterns comprising the steps of processing of input stream of network packets, statistical data processing, network usage patterns

identification.

In a first possible implementation form of the method according to the third aspect, the processing of input stream of network packets includes: parsing of network packets according the preliminary defined masks of data fields; and processing of input buffer including the process of buffering of instances and calculating the matrix of statistics on the basis of results of buffering.

In a second possible implementation form of the method according to the third aspect, statistical data processing includes: calculating Port Activity Matrix; calculating Singular Value Decomposition for Port Activity Matrix; calculating the projection of the matrix of statistics into the space of basic usage models; calculating the matrix of distances;

calculating the matrix of usage decomposition on the basis of matrix of distances. In a third possible implementation form of the method according to the third aspect, the identification of network usage patterns includes: calculating the vector of similarity coefficients; calculating of the set of markers for support groups; assigning instances to support groups and calculating the coordinates of the center for each support group; purifying support groups by moving of inappropriate instances into the secondary groups; P T/RU2015/000657

and constructing the set of clusters of instances on the basis of support groups and secondary groups.

According to a fourth aspect, the invention relates to a system for automatic discovery of network usage patterns comprising: an input subsystem which includes the means for capturing of network packet stream, the stream of network packets to be analyzed is the output of this subsystem; a subsystem of data stream preprocessing which includes the means for parsing of network packets, the stream of network packets to be analyzed is the input of this subsystem, the stream of pairs Src instance and Dst instance is the output of this subsystem; a subsystem of input buffer processing which includes the means for the filling of the buffer to be analyzed, the stream of pairs Src instance; Dst instance is the input of this subsystem, the stream of matrixes of statistics Src vs Dst is the output of this subsystem; a subsystem of statistical data processing which includes the means for the calculation of usage decomposition matrix, the stream of matrices of statistics Src vs Dst is the input of this subsystem, the stream of usage decomposition matrixes is the output of this subsystem; a subsystem of usage patterns identification which includes the means for constructing of the set of clusters of Src instances, the stream of usage decomposition matrixes is the input of this subsystem, the stream of the sets of clusters is the output of this subsystem; and an output subsystem which includes the means performing the output of the results of automatic discovery of network usage patterns.

By using the method 100, systems 200, 300 and subsystems 400, 500, 500, 700 as described above, application scenarios and example embodiments as described in the following can be implemented:

A first scenario is developing of a scalable network monitoring system which may be adapted for various levels of granularity of analysis: Host-level analysis tools may be used for automated identification of models characterizing software applications that are running on a definite hardware unit. Network-level analysis tools may be used for automated identification of models characterizing the usage of hardware units present in the definite segments of the wired or wireless networks.

A second scenario is developing of a scalable software and/or hardware tools applicable for automated analysis of traffic streams. These tools may be used for the following set of purposes: Automatic identification of patterns characterizing network behavior of users, automatic profiling of these patterns; Automatic detection of situations which characterize high risk of network attacks; Automatic detection of unauthorized intruders in the network; Automatic detection of the cases characterizing the fraudulent use of hardware or software tools; Automatic detection of situations which characterize high risk of failure in the monitored network segment;

The field of application may be analysis of resource usage patterns both in wired- and wireless networks. In a first embodiment, the purpose is searching for anomalies in network traffic between two specific IP addresses. A typical workflow is as follows: Setting the value of mask SrcName to IPSrc (IP address of the sender); Setting the value of mask DstName to IPDst (IP address of the receiver); Setting the IP addresses of two network nodes to be controlled; Training phase: automatic creation of a statistical model that characterizes the transfer of information between the selected nodes of telecommunication network; Stage of traffic control: the use of the created model for the analysis in order to detect anomalies in traffic between two network nodes.

In a first embodiment, the purpose is monitoring of unauthorized use of network resources by specific software. A typical workflow is as follows: Creation of a testbed for the use in training phase; Setting the value of mask SrcName to IPSrc (IP address of the sender); Setting the value of mask DstName to SrcPort (Port used by sender software); Training phase: automatic creation of a statistical model that characterizes network interaction of controlled software; Stage of traffic control: the use of the created model to detect unauthorized use of network resources by controlled software.

The methods, systems and devices as described in this disclosure can bring three (or more) kinds of effects as described in the following. The creation of new means for the control of telecommunications networks, which have the set of still non-realized features. The systems, devices and methods according to the disclosure facilitate to develop the set of network monitoring tools being able to automatically extract the models of network resources usage. This procedure of extraction may be realized at different levels of hierarchy of analyzed telecommunication network. Automatic multi-parameter analysis of data stream may be realized as a procedure carried U2015/000657

out in a real-time parallel processing mode. Ability of the disclosed technique to reconfiguration makes it possible to realize adaptive methods for automatic control and management of telecommunication networks. Implementation of the methods, systems and devices according to the disclosure facilitate creation of scalable network monitoring tools. The same set of tools may be used both for host-level analysis and network-level analysis.

Implementation of the methods, systems and devices according to the disclosure facilitate development of a principally new set of software and hardware tools. The main perspective result of the implementation of such methods, systems and devices is the creation of a principally new class of tools intended for the monitoring of the traffic of wired and wireless networks. The use of adaptive control for network traffic may realize Smart Networks on the basis of the methods, systems and devices described hereinafter.

BRIEF DESCRIPTION OF THE DRAWINGS

Further embodiments of the invention will be described with respect to the following figures, in which:

Fig. 1 shows a schematic diagram illustrating a method 100 for automatic discovery of network usage patterns according to an implementation form; Fig. 2 shows a schematic diagram illustrating a system 200 for automatic discovery of network usage patterns according to an implementation form;

Fig. 3 shows a flowchart illustrating an exemplary embodiment of a system 300 for automatic discovery of network usage patterns according to an implementation form;

Fig. 4 shows a flow diagram illustrating an example of network traffic stream

preprocessing 400 according to an implementation form;

Fig. 5 shows a flow diagram illustrating an example of input buffer processing 500 according to an implementation form; Fig. 6 shows a flow diagram illustrating an example of statistical data processing 600 according to an implementation form; and Fig. 7 shows a flow diagram illustrating an example of usage patterns identification 700 according to an implementation form.

DETAILED DESCRIPTION OF EMBODIMENTS

In the following detailed description, reference is made to the accompanying drawings, which form a part thereof, and in which is shown by way of illustration specific aspects in which the disclosure may be practiced. It is understood that other aspects may be utilized and structural or logical changes may be made without departing from the scope of the present disclosure. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.

It is understood that comments made in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if a specific method step is described, a corresponding device may include a unit to perform the described method step, even if such unit is not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary aspects described herein may be combined with each other, unless specifically noted otherwise.

Fig. 1 shows a schematic diagram illustrating a method 100 for automatic discovery of network usage patterns according to an implementation form.

The method 100 includes parsing 101 an input stream of network packets from a communication network according to a predetermined source network address and a predetermined destination network address. The method 100 includes determining 102 a set of source network address packet instances based on the parsed input stream of network packets with respect to the predetermined source network address and a set of destination network address packet instances based on the parsed input stream of network packets with respect to the predetermined destination network address. The method 100 includes determining 103 a matrix of statistics T(i,j) based on statistical evaluation of the set of source network address packet instances versus the set of destination network address packet instances. The method 100 includes determining 104 a network usage decomposition matrix Z(i,j) based on statistical evaluation of the matrix of statistics T(i j) with respect to service port information indicated in the parsed input stream of network packets. The method 100 includes determining 105 a set of support groups based on statistical evaluation of the network usage decomposition matrix Z(ij). The method 100 includes assigning 106 the set of source network address packet instances to a respective support group of the set of support groups based on a distance of the network usage decomposition matrix Z(i,j) with respect to the respective support group.

The method 100 may further include: determining a port activity matrix PAM based on statistical evaluation of the matrix of statistics T(i,j) with respect to the service port information indicated in the parsed input stream of network packets; and determining 104 the network usage decomposition matrix Z(i,j) based on a decomposition of the port activity matrix PAM.

The method 100 may further include: determining 104 the network usage decomposition matrix Z(i,j) based on a singular value decomposition, e.g. a singular value decomposition 603 as described with respect to Fig. 6, of the port activity matrix PAM into a first ancillary matrix U, a decomposed matrix S and a second ancillary matrix V^T.

The method 100 may further include: determining 604 a projection matrix F(i,j) based on a projection of the matrix of statistics T(i,j) by using the second ancillary matrix V^T, e.g. as described below with respect to Fig. 6.

The method 100 may further include: determining 605 a matrix of distances D(i,j), e.g. as described below with respect to Fig. 6, based on a distance of the matrix of statistics T(i,j) with respect to the projection matrix F(i,j).

The distance may be according to the following relation:

where i is the index of source network address packet instance and j is the index of destination network address packet instance. The method 100 may further include: determining 104 the network usage decomposition matrix Z(i,j) based on the matrix of distances D(i,j), e.g. as described with respect to Fig. 6.

The method 100 may further include: determining 104 the network usage decomposition matrix Z(i,j) according to the following relation:

e.g., as described below with respect to Fig. 6.

The determining 105 the set of support groups may include: determining 702 a covariance matrix covZ_g based on the network usage decomposition matrix Z(i,j), e.g. as described below with respect to Fig. 7.

The method 100 may further include: determining a number of support groups in the set of support groups based on a sum of the covariance matrix covZi_j)with respect to a number of destination network address packet instances (j).

The method 100 may further include: determining 703 a number of support groups in the set of support groups based on the following relation:

e.g. as described below with respect to Fig. 7, wherein the number of support groups is equal to the number of different values of a support group array simCoeffi with respect to a number i of source network address packet instances.

The method 100 may further include: assigning 106 the set of source network address packet instances to a respective support group based on a distance of the support group array simCoeffi with respect to an array of support group markers (Markj), e.g. as described below with respect to Fig. 7.

The assigning 106 may be based on the following relation:

0657

wherein Mark is the array of support group markers and ε is the value of a calculation error. The method 100 may further include: determining 707 a center of a respective support group as a vector of mean values of coordinates of the source network address packet instances included in the respective support group, e.g. as described below with respect to Fig. 7; and checking 708 if source network address packet instances assigned to a respective support group have a predetermined distance to the center of the respective support group, e.g. as described below with respect to Fig. 7.

Fig. 2 shows a schematic diagram illustrating a system 200 for automatic discovery of network usage patterns according to an implementation form. The system 200 includes a preprocessing subsystem 201 (e.g., a network packet parser) for parsing an input stream 210 of network packets from a communication network according to a predetermined source network address and a predetermined destination network address. The system 200 includes an input buffer processing subsystem 202 for determining a set of source network address packet instances 212 based on the parsed input stream 21 1 of network packets with respect to the predetermined source network address and a set of destination network address packet instances based on the parsed input stream of network packets with respect to the predetermined destination network address. The system 200 further includes a statistical data processing subsystem 203 for determining a matrix of statistics T(i,j) based on statistical evaluation of the set of source network address packet instances versus the set of destination network address packet instances; and for determining a network usage decomposition matrix Z(i,j) based on statistical evaluation of the matrix of statistics T(i,j) with respect to service port information indicated in the parsed input stream 21 1 of network packets. The system 200 further includes a usage patterns identification subsystem 204 for determining a set of support groups based on statistical evaluation of the network usage decomposition matrix Z(i,j) and for assigning 214 the set of source network address packet instances to a respective support group of the set of support groups based on a distance of the network usage decomposition matrix Z(i,j) with respect to the respective support group. The preprocessing subsystem 201 may process step 101 of the method 100 described above with respect to Fig. 1. The input buffer processing subsystem 202 may process the step 102 of the method 100. The statistical data processing subsystem 203 may process the steps 103 and 104 of the method 100. The usage patterns identification subsystem 204 may process the steps 105, 106 of the method 100.

Fig. 3 shows a flowchart illustrating an exemplary embodiment of a system 300 for automatic discovery of network usage patterns according to an implementation form. The system 300 includes an input subsystem 30 , a subsystem of data stream

preprocessing 302 that may correspond to the preprocessing subsystem 201 according to Fig. 2, a subsystem of input buffer processing 303 that may correspond to the input buffer processing subsystem 202 according to Fig. 2, a subsystem of statistical data processing 304 that may correspond to the statistical data processing subsystem 203 according to Fig. 2, a subsystem of network usage pattern identification 305 that may correspond to the usage patterns identification subsystem 204 according to Fig. 2 and an output subsystem 306 for outputting data.

The flowchart represented in Figure 3 illustrates an example embodiment of the System of Automatic Discovery of Network Usage Patterns. For the use of this embodiment the masks of two fields of network packets which will be used at analysis will be defined as: SrcName and DstName.

Fig. 4 shows a flow diagram illustrating an example of network traffic stream

preprocessing 400 according to an implementation form, i.e. an example of the subsystem 302 shown in Fig. 3.

After a beginning block 401 , in a second block 402 it is asked if there is an input packet to be processed. If the answer is yes, a third block 403 is processed to perform the parsing of the input packet and a fourth block 404 is processed to output the results of packet parsing. Then a jump to the second block 402 is performed. If the answer of second block 402 is no, a fifth block 405 is performed for checking if a variable "FlagExit" equals 1. If the answer is yes, the flow diagram goes to an exit block 406, if the answer is no, a jump to the second block 402 is performed. The diagram represented in Figure 4 illustrates the example of network traffic stream preprocessing. In represented embodiment of the System input data stream comes to a System from the Input Subsystem. In example embodiment this subsystem is used as an interface between the analyzed network and System of Automatic Identification of Network Usage Patterns. Main purpose of data preprocessing is to prepare the data which will be used at automatic analysis of network usage. Block 405 illustrates the check of the state of FlagExit flag. This flag is handled by some external procedure to stop the analysis of the network usage. At the stage of data stream preprocessing input packet is parsed (block 403). At this stage the header of the network packet is subdivided on the set of the separate fields. Pre-defined masks SrcName and DstName determine the logic of the parsing. Output of parsing procedure includes two instances: the instance of the field SrcName and the instance of the field DstName which are represented in the header of network packet. Fig. 5 shows a flow diagram illustrating an example of input buffer processing 500 according to an implementation form, i.e. an example of the subsystem 303 shown in Fig. 3.

After a beginning block 501 , in a second block 502 the buffer of Src- and Dst instances and the value of W are updated 502 and in a third block 503 it is asked if W is smaller than WIDTH. If the answer is yes, the flow diagram reaches its end block 508, otherwise a fourth block 504 is processed to calculate arrays of unique Src- and Dst instances. Then a fifth block 505 is processed to calculate the matrix of statistics Src versus Dst: T(i,j), a sixth block 506 is performed to clear the buffer of Src- and Dst instances; and to set W=0, and a seventh block 507 is processed to output the results, and the flow diagram reaches the end block 508.

The diagram represented in Figure 5 illustrates the example of input buffer processing. In represented embodiment the Subsystem of Input Buffer Processing is used for the following set of actions: the accumulation of the data of input buffer; the calculation of the output arrays of unique Src- and Dst instances; the calculation of matrix of statistics Src vs Dst. The output of the workflow of mentioned subsystem includes the arrays of unique names and the matrix of statistics T. Fig. 6 shows a flow diagram illustrating an example of statistical data processing 600 according to an implementation form, i.e. an example of the subsystem 304 shown in Fig. 3. In the statistical data processing 600, the following eight blocks 601 , 602, 603, 604, 605, 606, 607, 608 are sequentially processed. After a beginning block 601 , a second block 602 is processed to calculate the Port Activity Matrix (PAM) on the basis of the matrix of statistics T(i,j). A third block 603 is processed to calculate singular value decomposition for PAM according to PAM=U S V^T A fourth block 604 is processed to make the projection of the input matrix of statistics according to: F=T V. A fifth block 605 is processed to calculate the matrix of distances from the current usage models to the set of basic usage models: D(i,j). A sixth block 606 is processed to calculate the matrix of usage decompositions Z(i,j) on the basis of the matrix of distances D(i,j). A seventh block 607 is processed to output the results and an eighth block 608 indicates the end.

The diagram represented in Figure 6 illustrates an example of statistical data processing. This stage of data processing is performed by Subsystem of statistical data processing. In the represented embodiment this subsystem is intended for the calculation of network usage decomposition matrix. This calculation is performed on the basis of the matrix of statistics 7. The Port Activity Matrix (PAM) may be calculated on the basis of numerical method, e.g. as proposed in the paper Έ. Sharafuddin, Y. Jin, N. Jiang, Z. Zhang. Know Your Enemy, Know Yourself: Block-Level Network Behavior Profiling and Tracking // Global Telecommunications Conference (GLOBECOM 2010), IEEE, 2010". The matrix of

network usage decomposition matrix (block 606) may be calculated on the basis of the following expression: Z, _y = 1 -

Fig. 7 shows a flow diagram illustrating an example of usage patterns identification 700 according to an implementation form, i.e. an example of the subsystem 305 shown in Fig.

3. In the usage patterns identification 700, the following ten blocks 701 , 702, 703, 704, 705, 706, 707, 708, 709, 710 are sequentially processed. After a beginning block 701 , a second block 702 is processed to calculate the matrix of covariance covZ(i,j) on the basis of the matrix Z(ij). A third block 703 is processed to calculate the values of the

components for the vector of similarity coefficients simCoeff(i). A fourth block 704 is processed to define the values of the components for the vector of markers of support groups. A fifth block 705 is processed to make the formation of the support groups on the basis of group markers. A sixth block 706 is processed to make the assignment of the Src instances to support groups. A seventh block 707 is processed to calculate the

coordinates of the center of each group. An eighth block 708 is processed to check the relevance of the assignment of each instance into the definite group, and to move each excluded Src instance from the support group into the separate secondary group. A ninth block 709 is processed to make the output of the results of clustering and to clear all data arrays responsible for the processing of time window data. A tenth block 710 indicates the end.

The diagram represented in Figure 7 illustrates an example of usage patterns identification. In the embodiment of Fig. 7, the Subsystem of usage patterns identification 700 in several steps makes the final calculation of clusters of Src instances. This calculation may be done on the basis of the preliminary calculated matrix of usage decomposition. Components of the vector of similarity coefficients may be calculated on the basis of the following expression: simCoeff ] = cov Z_jJ The number of support j

groups may be equal to the number of different values of simCoeff. Assignment of Src instances into the corresponding group may be based on the following rule. If the condition < s is true then /-th Src instance is included into the ^'-th

support group. Here: Mark is the array of support group markers; ε is the value of calculation error. Coordinates of the center of each group may be calculated as a vector of mean values of coordinates of Src instances included into that group. The check of the relevance of the assignment of Src instances into the definite group may be based on the following rule. If the condition ∑(covZ_v -cJ² false then /^'-th instance may be j

excluded from the p-th support group. Here: C is the vector of coordinates of a definite support group; δ is predefined value of the accuracy. In the represented embodiment, the output block 709 or the Output subsystem 306 according to Fig. 6 may produce the output of results of the clustering of Src instances. The output block 709 may produce the output of the set of data structures UGD. The element UGD, includes information about instances of Src associated with some definite group. This information includes the index of the cluster; the value of the marker of current group of Src instances; the set of indexes of Src instances associated with this cluster and usage model characterizing current group of instances. The output block 709 may further produce the output of the set of the unique names of Src instances presented in current time window. The output block 709 may further produce the output of the set of the unique names of Dst instances presented in current time window.

The present disclosure also supports a computer program product including computer executable code or computer executable instructions that, when executed, causes at least one computer to execute the performing and computing steps described herein, in particular the methods 100, 300, 400, 500, 600 as described above with respect to Figs. 1 and 3-6 or the system 200 described above with respect to Fig. 2. Such a computer program product may include a readable storage medium storing program code thereon for use by a computer. The program code may perform the method 100, 300, 400, 500, 600 as described above with respect to Figs. 1 and 3-6 or the system 200 described above with respect to Fig. 2.

While a particular feature or aspect of the disclosure may have been disclosed with respect to only one of several implementations, such feature or aspect may be combined with one or more other features or aspects of the other implementations as may be desired and advantageous for any given or particular application. Furthermore, to the extent that the terms "include", "have", "with", or other variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term "comprise". Also, the terms "exemplary", "for example" and "e.g." are merely meant as an example, rather than the best or optimal. The terms "coupled" and "connected", along with derivatives may have been used. It should be understood that these terms may have been used to indicate that two elements cooperate or interact with each other regardless whether they are in direct physical or electrical contact, or they are not in direct contact with each other. Although specific aspects have been illustrated and described herein, it will be

appreciated by those of ordinary skill in the art that a variety of alternate and/or equivalent implementations may be substituted for the specific aspects shown and described without departing from the scope of the present disclosure. This application is intended to cover any adaptations or variations of the specific aspects discussed herein.

Although the elements in the following claims are recited in a particular sequence with corresponding labeling, unless the claim recitations otherwise imply a particular sequence for implementing some or all of those elements, those elements are not necessarily intended to be limited to being implemented in that particular sequence.

Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein. While the present invention has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto without departing from the scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described herein.

Claims

CLAIMS:

1. A method (100) for automatic discovery of network usage patterns, the method comprising: parsing (101) an input stream of network packets from a communication network according to a predetermined source network address and a predetermined destination network address; determining (102) a set of source network address packet instances based on the parsed input stream of network packets with respect to the predetermined source network address and a set of destination network address packet instances based on the parsed input stream of network packets with respect to the predetermined destination network address; determining (103) a matrix of statistics (T(i,j)) based on statistical evaluation of the set of source network address packet instances versus the set of destination network address packet instances; determining (104) a network usage decomposition matrix (Z(i,j)) based on statistical evaluation of the matrix of statistics (T(i,j)) with respect to service port information indicated in the parsed input stream of network packets; determining (105) a set of support groups based on statistical evaluation of the network usage decomposition matrix (Z(ij)); and assigning (106) the set of source network address packet instances to a respective support group of the set of support groups based on a distance of the network usage decomposition matrix (Z(i,j)) with respect to the respective support group.

2. The method (100) of claim 1 , comprising: determining (602) a port activity matrix (PAM) based on statistical evaluation of the matrix of statistics (T(ij)) with respect to the service port information indicated in the parsed input stream of network packets; and determining (104) the network usage decomposition matrix (Z(i,j)) based on a decomposition of the port activity matrix (PAM).

3. The method (100) of claim 2, comprising: determining (104) the network usage decomposition matrix (Z(i,j)) based on a singular value decomposition (603) of the port activity matrix (PAM) into a first ancillary matrix U, a decomposed matrix S and a second ancillary matrix V^T.

4. The method (100) of claim 3, comprising: determining (604) a projection matrix (F(i,j)) based on a projection of the matrix of statistics (T(i,j)) by using the second ancillary matrix V^T.

5. The method (100) of claim 4, comprising: determining (605) a matrix of distances (D(i,j)) based on a distance of the matrix of statistics (T(i,j)) with respect to the projection matrix (F(i,j)).

6. The method (100) of claim 5, wherein the distance is according to the following relation:

where i is the index of source network address packet instance and j is the index of destination network address packet instance.

7. The method (100) of claim 5 or 6, comprising: determining (104) the network usage decomposition matrix (Z(i,j)) based on the matrix of distances (D(i,j)).

8. The method (100) of claim 7, determining (104) the network usage decomposition matrix (Z(i,j)) according to the following relation:

9. The method (100) of claim one of the preceding claims, wherein determining (105) the set of support groups comprises: determining (702) a covariance matrix (covZi_j) based on the network usage decomposition matrix (Z(i,j)).

10. The method (100) of claim 9, comprising: determining a number of support groups in the set of support groups based on a sum of the covariance matrix (covZy) with respect to a number of destination network address packet instances (j).

1 1. The method (100) of claim 9 or 10, comprising: determining (703) a number of support groups in the set of support groups based on the following relation: simCoeff _t = ^ cov Z_tJ

j

wherein the number of support groups is equal to the number of different values of a support group array simCoeff, with respect to a number i of source network address packet instances.

12. The method (100) of claim 1 1 , comprising assigning (106) the set of source network address packet instances to a respective support group based on a distance of the support group array simCoeff, with respect to an array of support group markers (Mark_j).

13. The method (100) of claim 12, wherein the assigning (106) is based on the following relation:

- Mark_p | < ε wherein Mark is the array of support group markers and ε is the value of a calculation error.

14. The method (100) of one of the preceding claims, comprising: determining (707) a center of a respective support group as a vector of mean values of coordinates of the source network address packet instances included in the respective support group; and checking (708) if source network address packet instances assigned to a respective support group have a predetermined distance to the center of the respective support group.

15. A system (200) for automatic discovery of network usage patterns, the system comprising: a preprocessing subsystem (201) for parsing an input stream (210) of network packets from a communication network according to a predetermined source network address and a predetermined destination network address; an input buffer processing subsystem (202) for determining a set of source network address packet instances (212) based on the parsed input stream (211) of network packets with respect to the predetermined source network address and a set of destination network address packet instances based on the parsed input stream of network packets with respect to the predetermined destination network address; a statistical data processing subsystem (203) for determining a matrix of statistics (T(i,j)) based on statistical evaluation of the set of source network address packet instances versus the set of destination network address packet instances; and for determining a network usage decomposition matrix (Z(i,j)) based on statistical evaluation of the matrix of statistics (T(i,j)) with respect to service port information indicated in the parsed input stream (211) of network packets; and a usage patterns identification subsystem (204) for determining a set of support groups based on statistical evaluation of the network usage decomposition matrix (Z(i,j)); and for assigning (214) the set of source network address packet instances to a respective support group of the set of support groups based on a distance of the network usage decomposition matrix (Z(i,j)) with respect to the respective support group.