CN115604090B

CN115604090B - Network anomaly root cause positioning method and system based on overfitting

Info

Publication number: CN115604090B
Application number: CN202211132605.7A
Authority: CN
Inventors: 朱文进
Original assignee: China Telecom Digital Intelligence Technology Co Ltd
Current assignee: China Telecom Digital Intelligence Technology Co Ltd
Priority date: 2022-09-17
Filing date: 2022-09-17
Publication date: 2025-05-23
Anticipated expiration: 2042-09-17
Also published as: CN115604090A

Abstract

The invention relates to a network anomaly root cause positioning method and system based on overfitting, and belongs to the technical field of network operation and maintenance. The method comprises the steps of collecting network node data among routers, generating a route service identifier and a network node service identifier, testing network delay among the routers, determining the abnormal time of the network delay as abnormal time, importing the associated route data and the associated network node data collected in the abnormal time into a fault source positioning analysis model for analysis, determining an overfitting value, and sequencing the service data according to the overfitting value to position the fault source. The method can rapidly locate the source of network abnormality when the invisible network delay possibly caused by the network switch information passing between each hop of route cannot be obtained, and improves the efficiency of network operation and maintenance.

Description

Network anomaly root cause positioning method and system based on overfitting

Technical Field

The invention belongs to the technical field of network operation and maintenance, and particularly relates to a network anomaly root cause positioning method and system based on overfitting.

Background

With the gradual penetration of digital development, the on-line equipment of the global local area network is gradually increased, and compared with the equipment which is increased by 10-100 times before ten years, even if operation and maintenance are developed from manual operation and maintenance to tool operation and maintenance and platform operation and maintenance, the requirement of the current ultra-large local area network on operation and maintenance monitoring still cannot be met. Under the large scale, monitoring network equipment by manual experience and automatic operation and maintenance becomes a technical bottleneck for restricting operation and maintenance work. In the prior art, the problem of network delay possibly caused by the fact that the information of a network switch passing through each hop of route cannot be acquired in the route tracking process is difficult to achieve. Therefore, a more intelligent and efficient optimization method for monitoring the TR069 protocol is introduced to improve the network operation and maintenance monitoring capability.

In the prior art, the problems of large service scale, complex application relation, multiple dependency layers and difficult inquiry problem in the operation and maintenance scene of a machine room exist. Under the large scale, monitoring network equipment by manual experience and automatic operation and maintenance becomes a technical bottleneck for restricting operation and maintenance work. In the prior art, the problem that network delay between routes is ignored because network switch information passing between routes of each hop cannot be acquired in the route tracking process is difficult to achieve, and the problem of invisible network delay of the switch is possibly caused.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art, and provides a network anomaly root cause positioning method and system based on overfitting, which can perform rapid network anomaly root cause positioning when invisible network delay possibly caused by network switch information passing between each hop of route cannot be acquired, thereby improving the efficiency of network operation and maintenance.

According to one aspect of the invention, the invention provides a network anomaly root cause positioning method based on overfitting, which comprises the following steps:

S1, collecting network node data between routers, analyzing associated service identifiers through logs, generating route service identifiers and network node service identifiers, and creating an initial data pool to store data from the routers and the network nodes;

S2, testing network delay among routers, determining the abnormal time of the network delay as abnormal time, importing the associated route data and the associated network node data acquired in the abnormal time into a fault source positioning analysis model for analysis, and determining abnormal data corresponding to the fitting value;

And S3, matching and classifying the IP address of the abnormal data corresponding to the overfitting value with the route service identifier and the network node service identifier to obtain the overfitting value service data sequence of the abnormal time, and positioning the fault source according to the overfitting value service data sequence.

Preferably, the generating the route service identifier and the network node service identifier includes:

the formats of the route service identifier and the network node service identifier are as follows:

Route service identifier, namely a router A# # service id1 and a router A# # service id2

Network node service identifier, network node # service id1 and service id2

The multiple traffic is separated by commas and the multiple routes or network nodes are separated by # numbers.

The method comprises the steps of collecting data, carrying out service classification according to service identifications associated with network node IP corresponding to the collected data, carrying out data division according to service weights, generating route service identifications and network node service identifications, calculating a thread pool load index, analyzing thread pool occupancy rate, and dispatching threads according to the thread pool occupancy rate, wherein the thread pool load index is as follows:

Wherein, N is the number of working threads in the thread pool running, N _max is the set maximum number of threads, T _cur is the number of tasks in the current acquisition time window, T _pre is the number of tasks in the last acquisition time window, Q is the task buffer queue size, and ζ ₁、ξ₂、ξ₃ is the weight coefficient.

Preferably, the creating the initial data pool to store data from the router and the network node includes:

And analyzing the data source type through the routing service identifier and the network node service identifier, and creating a text data pool, an analog signal data pool and an application data pool.

Preferably, the fault source location analysis model is:

||Xθ-y||²+||Γθ||²

θ(a)=(X^TX+aI)^-1X^Ty

The method comprises the steps of adding regularization to an operation process, wherein X represents input, y represents an output prediction result, I represents a regular operation, I represents an identity matrix, theta is a fitting hyper-parameter, gamma is a weight constant, a is the weight of the identity matrix, and theta (a) represents a value of theta under the condition that a is determined.

According to another aspect of the present invention, there is also provided a network anomaly root-cause positioning system based on overfitting, the system comprising:

The generation module is used for collecting network node data between routers, analyzing the associated service identifiers through logs, generating a route service identifier and a network node service identifier, and creating an initial data pool to store data from the routers and the network nodes;

The system comprises a determining module, a fault source positioning analysis model, a network delay analysis module and a network delay analysis module, wherein the determining module is used for testing network delay between routers and determining the abnormal time of the network delay as abnormal time;

And the positioning module is used for matching and classifying the IP address of the abnormal data corresponding to the overfitting value with the route service identifier and the network node service identifier, obtaining the overfitting value service data sequence of the abnormal time, and positioning the fault source according to the overfitting value service data sequence.

Preferably, the generating module generates the route service identifier and the network node service identifier includes:

Network node service identifier, network node # service id1 and service id2

Preferably, the creating the initial data pool by the generating module to store data from the router and the network node includes:

Preferably, the fault source location analysis model is:

||Xθ-y||²+||Γθ||²

θ(a)=(X^TX+aI)^-1X^Ty

The method has the beneficial effects that the method can effectively find out the problem of invisible network delay possibly caused by the lack of network switch information in the process of implementing commands such as Traceroute, ping and the like to realize route tracking, and simultaneously more intuitively understand the network delay and packet loss conditions of switches and network nodes between routers in a network through a network topology graph, so that the network topology graph is closer to the actual condition.

Features and advantages of the present invention will become apparent by reference to the following drawings and detailed description of embodiments of the invention.

Drawings

FIG. 1 is a flow chart of a network anomaly root cause positioning method based on overfitting;

FIG. 2 is a schematic diagram of a network anomaly root-cause positioning system based on overfitting.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

FIG. 1 is a flow chart of a network anomaly root cause positioning method based on overfitting. As shown in fig. 1, the invention provides a network anomaly root-cause positioning method based on overfitting, which comprises the following steps:

S1, collecting network node data between routers, analyzing associated service identifiers through logs, generating route service identifiers and network node service identifiers, and creating an initial data pool to store data from the routers and the network nodes.

Specifically, the network latency between routers is tested by Traceroute (tracert in Windows systems) command to locate all routers between your computer and the target computer using ICMP protocol. The TTL value may reflect the number of routers or gateways through which the packet passes, and the traceroute command may traverse to all routers on the packet transmission path by manipulating the TTL value of the independent ICMP call packet and observing the return information that the packet was discarded.

Network node service identifier, network node # service id1 and service id2

Specifically, service classification is performed according to service IDs associated with network nodes IP corresponding to collected data, data division is performed according to service weights, service identifiers are generated, different thread pools are created according to different data sources, occupancy rates of the thread pools are analyzed through an algorithm, and the idle thread pools are preferentially scheduled to be preferentially stored for the collected data with large service weights.

Calculating a thread pool load metricThe load degree is converted from the data of the working thread number, the maximum thread number, the task buffer queue size and the like when the thread pool runs, and a percentage value is obtained through calculation of different weight proportions.

In the formula (i),Describing the saturation of the worker thread,Describing the saturation of the current task,The task buffer queue growth rate is described. And comparing the preset thread pool load degree omega ', triggering the self-adaptive parameter adjustment calculation if the load degree omega ' is larger than the preset thread pool load degree omega ', and otherwise, skipping the current acquisition time window. Then, the current thread pool occupancy rate is acquired and is analyzed when the current thread pool occupancy rate is lower than 50 percent, namely, the current thread pool occupancy rate is idle.

Specifically, an initial data pool is created and data from the route and network nodes are stored in order, and a text data pool, an analog signal data pool and an application data pool are created through analysis of data source types by route service identifiers and network node service identifiers.

The TR069 protocol collects data and creates a text data pool for storage because the data is transferred in xml file format.

Analog signal data pool the trace command return data feature is that the acquisition type is stored in the analog signal data pool.

And the application data pool is used for collecting a large amount of data with the numerical value marked with the service ID by the TR069 protocol, and storing the data in the application data pool.

Compared with a database, the data pool can integrate data sources of different data structures uniformly, meanwhile, as text types are opened up according to data characteristics of different data sources, application types are adopted, three data pools of collection types are stored in a pool, and the mass data storage efficiency is improved.

And S2, testing network delay among routers, determining the abnormal time of the network delay as abnormal time, and importing the associated route data and the associated network node data acquired in the abnormal time into a fault source positioning analysis model for analysis to determine abnormal data corresponding to the fitting value.

Specifically, network node delay data among routers are collected through a TR069 protocol, and associated service IDs are analyzed through logs to generate service identifiers. TR069, collectively, "TECHNICAL REPORT 069", is a technical specification revised by DSL Forum (a non-profit worldwide industry alliance, working on developing Broadband network paradigms, members of which include leading vendors of industries such as communications, equipment, computers, networks, and service providers, now more named "Broadband Forum"), which is an application layer management Protocol, named "CPE wide area network management Protocol (CPE WAN MANAGEMENT Protocol)". TR069 defines a set of brand-new network management system structure, including management model, interaction interface and basic management parameters, which can effectively implement management of home network equipment. In TR-069, the network management server is called ACS (Auto Configuration Server automatic configuration server) with special IP address and URL, the managed device obtains the URL of ACS by DHCP server, and after obtaining the network management IP, the managed device starts to build HTTP session according to the URL of ACS. After the session is established, initialization is required, the purpose of which is to perform authentication, and the ACS is to ensure the validity of the managed device. After the initialization is completed, the network management server can acquire various monitoring information from the CPE.

Preferably, the fault source location analysis model is:

||Xθ-y||²+||Γθ||²

θ(a)=(X^TX+aI)^-1X^Ty

The least squares method commonly used in regression analysis is an unbiased estimate. For one qualified problem, X is typically xθ=y of column full rank,

Defining a loss function as the square of the residual error by adopting a least square method, and minimizing the loss function

||Xθ-y||².

The optimization problem can be solved by adopting a gradient descent method, or can be directly solved by adopting the following formula

θ=(X^TX)^-1X^Ty,

When X is not the column full rank, or when the linear correlation between some columns is relatively large, the determinant of X ^T X is close to 0, i.e., X ^T X is close to singular, the above problem becomes an ill-posed problem, and at this time, the error in calculation (X ^TX)^-1) is large, and the conventional least square method lacks stability and reliability.

To solve the above problem, we need to transform the uncertainty problem into a fitness problem, we add a regularization term to the above-mentioned loss function, become

||Xθ-y||²+||Γθ||²

Where Γ=ai is defined, then:

θ(a)=(X^TX+aI)^-1X^Ty

In the above formula, I is an identity matrix.

Specifically, a ridge regression algorithm is adopted to construct a fault source positioning analysis model, the abnormal time of a Traceroute command test route is taken as abnormal time, a route service identifier and a network node service identifier are analyzed to obtain a correlation route, and other network node data are put into the model to obtain a fitting value for difference comparison, and the data with larger differentiation are summarized and sequenced with a router. The more data the greater the probability of locating the root cause of the fault. Thus completing the solution to the problem of hidden network delay which may be caused by the fault source location. The method specifically comprises the following steps:

First, the time when the router executes the Traceroute command to return to the network delay exception is the exception time.

Secondly, the route service identification and the network node service identification are analyzed to obtain the router and other associated network nodes or switch data of other routers except the router associated with the network nodes between the router and the network node service identification.

Then, the Traceroute command is executed at the abnormal time to obtain other route data such as route C and route D, etc. related to the router route. At the same abnormal time, executing a Traceroute command to acquire associated route data and executing a TR069 protocol to acquire network node and switch data associated with other services.

And finally, putting the associated route data and the associated network node data acquired in the abnormal time into a fault source positioning analysis model to obtain a route fitting value and a network node fitting value. And comparing the two values, wherein the difference is more than or equal to 10%, and the fitting is performed. It is explained that at the abnormal time, the associated route data and the associated network node data are greatly different, and the difference is explained that more abnormal data are generated.

Specifically, the abnormal data IP address corresponding to the overfitting value is extracted to match and classify the route service identifier and the network node service identifier, the overfitting value service data ordering of the abnormal time associated with the router and the network node between the routers is obtained, the probability of locating the fault source is higher as the data is more, and therefore the fault source locating to the problem of the invisible network delay possibly caused is completed.

The model overfitting is prevented by the ridge regression method, and the traditional least square method lacks stability and reliability. To solve the above problem, it is necessary to convert the ill-posed problem into a qualified problem, for which a regularization term may be added to the loss function.

The network anomaly root cause positioning method of the embodiment can rapidly perform network anomaly root cause positioning when invisible network delay possibly caused by network switch information passing between each two routes cannot be acquired, and improves network operation and maintenance efficiency.

The embodiment can effectively find out the network switch information loss in the process of implementing the command such as Traceroute, ping and the like to realize the route tracking, so that the problem of invisible network delay possibly caused can be solved, and meanwhile, the network delay and packet loss conditions of the switches and the network nodes between routers in the network can be more intuitively known through the network topology, so that the network topology is closer to the actual condition.

Example 2

FIG. 2 is a schematic diagram of a network anomaly root-cause positioning system based on overfitting. As shown in fig. 2, the present invention further provides a network anomaly root-cause positioning system based on overfitting, the system comprising:

The generating module 201 is configured to collect network node data between routers, analyze associated service identifiers through logs, generate a route service identifier and a network node service identifier, and create an initial data pool to store data from the routers and the network nodes;

the determining module 202 is used for testing network delay between routers, determining the abnormal time of the network delay as abnormal time, importing the associated route data and the associated network node data acquired in the abnormal time into the fault source positioning analysis model for analysis, and determining the abnormal data corresponding to the fitting value;

and the positioning module 203 is configured to match the IP address of the abnormal data corresponding to the overfitting value with the route service identifier and the network node service identifier, obtain an overfitting value service data sequence of the abnormal time, and perform fault source positioning according to the overfitting value service data sequence.

Preferably, the generating module 201 generates a route service identifier and a network node service identifier includes:

Network node service identifier, network node # service id1 and service id2

Preferably, the creating the initial data pool by the generating module 201 to store data from the router and the network node includes:

Preferably, the fault source location analysis model is:

||Xθ-y||²+||Γθ||²

θ(a)=(X^TX+aI)^-1X^Ty

The implementation process of the functions implemented by each module in this embodiment 2 is the same as the implementation process of each step in embodiment 1, and will not be described here again.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structural changes made by the specification and drawings of the present invention or direct/indirect application in other related technical fields are included in the scope of the present invention.

Claims

1. The network anomaly root-cause positioning method based on the overfitting is characterized by comprising the following steps of:

2. The method of claim 1, wherein generating the routing service identity and the network node service identity comprises:

Network node service identifier, network node # service id1 and service id2

3. The method of claim 2, wherein generating the routing service identity and the network node service identity comprises:

4. The method of claim 1, wherein creating an initial data pool to store data from routers and network nodes comprises:

5. The method of claim 1, wherein the fault source localization analysis model is:

||Xθ-y||²+||Γθ||²

θ(a)=(X^TX+aI)^-1X^Ty

6. A network anomaly root-cause positioning system based on overfitting, the system comprising:

And the positioning module is used for carrying out matching classification on the IP address of the abnormal data corresponding to the overfitting value, the route service identifier and the network node service identifier, obtaining the overfitting value service data sequence of the abnormal time, and carrying out fault source positioning according to the overfitting value service data sequence.

7. The system of claim 6, wherein the generating module generating the routing traffic identity and the network node traffic identity comprises:

Network node service identifier, network node # service id1 and service id2

8. The system of claim 7, wherein the generating module generating the routing traffic identity and the network node traffic identity comprises:

9. The system of claim 6, wherein the generating module creating an initial data pool to store data from the router and the network node comprises:

10. The system of claim 6, wherein the fault source location analysis model is:

||Xθ-y||²+||Γθ||²

θ(a)=(X^TX+aI)^-1X^Ty