CN110309009B

CN110309009B - Situation-based operation and maintenance fault root cause positioning method, device, equipment and medium

Info

Publication number: CN110309009B
Application number: CN201910421407.4A
Authority: CN
Inventors: 姚斯宇; 朱品燕
Original assignee: Beijing Yunji Zhizao Technology Co ltd
Current assignee: Beijing Yunji Zhizao Technology Co ltd
Priority date: 2019-05-21
Filing date: 2019-05-21
Publication date: 2022-05-13
Anticipated expiration: 2039-05-21
Also published as: CN110309009A

Abstract

The invention provides a situation-based operation and maintenance fault root cause positioning method, a situation-based operation and maintenance fault root cause positioning device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring alarm information corresponding to the alarm; inputting the alarm information into a machine learning model trained in advance to obtain a corresponding fault root; the process that the machine learning model determines the fault root cause according to the alarm information comprises the following steps: according to the alarm information, calculating the similarity between the alarm and a plurality of historical situations; determining a situation for searching a fault root cause of the alarm according to the similarity between the alarm and a plurality of historical situations; and calculating the importance of each warning source in the situation, and determining the fault root cause of the warning according to the importance of each warning source in the situation. The method does not need to have high requirements on operation and maintenance personnel, saves time and labor, considers the linkage effect of faults, performs global analysis on root cause positioning from the global angle and improves the accuracy of root cause positioning.

Description

Situation-based operation and maintenance fault root cause positioning method, device, equipment and medium

Technical Field

The invention relates to the technical field of fault location, in particular to a situation-based operation and maintenance fault root cause location method and device, computer equipment and a storage medium.

Background

Various faults can occur in the service operation process, experienced operation and maintenance personnel are often required to read a large amount of system alarm information, the fault root is analyzed and deduced by means of domain knowledge, the solution needs to have higher requirements on the operation and maintenance personnel, the consumption of manpower and material resources is higher, and meanwhile, the positioning result of the fault root is often subjected to certain deviation due to the limitation of professional domain knowledge of the operation and maintenance personnel.

Therefore, related technicians provide a method for searching fault root causes by a sequence analysis abnormity mode, the method mainly utilizes the sequence analysis mode to carry out data abnormity analysis, namely historical data is utilized to train a sequence prediction model, such as a moving average model, an LSTM sequence prediction model and the like, the abnormity detection is carried out by utilizing the difference between the value predicted by the model and the real data index of the machine, and when the difference exceeds a threshold value, the machine is considered to be in fault. The method mainly aims at the abnormity detection of a specific machine index, when a plurality of machines exist, different models need to be trained to detect simultaneously, and a plurality of models consume a large amount of computing power; meanwhile, a plurality of models can generate a large amount of false reports, so that a large amount of labor consumption is brought to operation and maintenance. In addition, the sequence analysis anomaly technology does not consider the linkage effect of faults, the modeling of the association relation among multiple machines is not sufficient, and the association existing among the anomalies cannot be detected, so that the root cause positioning cannot be globally analyzed from the global perspective.

Disclosure of Invention

In order to solve the technical problems or at least partially solve the technical problems, the invention provides a method, a device, a computer device and a storage medium for positioning an operation and maintenance fault root cause based on a situation.

In a first aspect, the present invention provides a method for locating an operation and maintenance fault root cause based on a situation, including:

acquiring alarm information corresponding to the alarm, wherein the alarm information comprises an alarm source, alarm time and abnormal description information which are generated in the service operation process;

inputting the alarm information into a machine learning model trained in advance to obtain a corresponding fault root cause; wherein the process of the machine learning model determining the fault root cause according to the alarm information comprises: according to the alarm information, calculating the similarity between the alarm and a plurality of historical situations; each history situation comprises alarm information corresponding to the history alarm in the corresponding history time period; determining a situation for searching a fault root cause of the alarm according to the similarity between the alarm and a plurality of historical situations; and calculating the importance of each warning source in the situation, and determining the fault root cause of the warning according to the importance of each warning source in the situation.

In a second aspect, the present invention provides a situation-based operation and maintenance fault root cause positioning apparatus, including:

the information acquisition module is used for acquiring alarm information corresponding to the alarm, wherein the alarm information comprises an alarm source, alarm time and abnormal description information which are generated in the service operation process;

the root cause determining module is used for inputting the alarm information into a machine learning model trained in advance to obtain a corresponding fault root cause; wherein the process of the machine learning model determining the fault root cause according to the alarm information comprises: according to the alarm information, calculating the similarity between the alarm and a plurality of historical situations; each history situation comprises alarm information corresponding to the history alarm in the corresponding history time period; determining a situation for searching a fault root cause of the alarm according to the similarity between the alarm and a plurality of historical situations; and calculating the importance of each warning source in the situation, and determining the fault root cause of the warning according to the importance of each warning source in the situation.

In a third aspect, the present invention provides a computer device comprising a processor and a computer program stored on a memory and executable on the processor, the processor implementing the steps of the method when executing the computer program.

In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described method.

The invention provides a situation-based operation and maintenance fault root cause positioning method, a situation-based operation and maintenance fault root cause positioning device, computer equipment and a storage medium. In the whole process, excessive participation of operation and maintenance personnel is not needed, so that high requirements on the operation and maintenance personnel are not needed, and time and labor are saved. Further, in the process of determining the fault root cause of the current alarm, the machine learning model firstly calculates the similarity between the current alarm and a plurality of historical situations, then determines the situation of searching the fault root cause of the current alarm according to the similarity between the current alarm and the plurality of historical situations, and finally determines the fault root cause in the situation. The fault root cause causing the current alarm may not be the alarm source in the current alarm, and may be the alarm source in other alarms, because the occurrence of a fault has a chain reaction, which may cause a series of abnormalities, and therefore multiple alarms may be caused, so that the fault root cause for the current alarm is not only searched in the current alarm, but also searched in other alarms, so a situation for searching the fault root cause is determined according to the similarity, and the alarms in the situation have a certain relationship, so that the accuracy of the finally determined fault root cause is higher. Therefore, the method and the device consider the linkage effect of the fault, namely consider the correlation among a plurality of warning sources, perform global analysis on the root cause positioning from the global angle, and improve the accuracy of the root cause positioning.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for locating a cause of a situation-based operation and maintenance fault according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart illustrating the process of determining the fault root according to the alarm information by the machine learning model in the embodiment of the present application;

FIG. 3 is a block diagram of a situation-based operation and maintenance fault root cause locating device in an embodiment of the present application;

fig. 4 is a block diagram of a computer device in the embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In a first aspect, an embodiment of the present application provides a method for locating an operation and maintenance fault root cause based on a context, as shown in fig. 1, the method includes:

s100, acquiring alarm information corresponding to the alarm, wherein the alarm information comprises an alarm source, alarm time and abnormal description information generated in the service operation process;

it can be understood that the alarm information refers to abnormal log information generated during the operation of the service program, and generally includes an alarm source, an alarm time, and abnormal description information, and may also include information such as an alarm level and a monitor program type.

The warning source refers to a machine with an abnormality, for example, a server in a background server cluster of a certain shopping website. In the alarm information of one alarm, a plurality of alarm sources may exist, that is, there is a case where abnormality occurs in a plurality of machines.

The exception description information refers to description information of an exception condition, for example, exception data information of some modules (processors) in the machine.

S200, inputting the alarm information into a machine learning model trained in advance to obtain a corresponding fault root;

it can be understood that the alarm information of the current alarm is input into the machine learning model trained in advance, and the machine learning model outputs the fault root of the current alarm.

Wherein the process of the machine learning model determining the fault root cause according to the alarm information comprises:

s210, calculating the similarity between the current alarm and a plurality of historical situations respectively according to the alarm information; each history situation comprises alarm information corresponding to the history alarm in the corresponding history time period;

it can be understood that, generally, the occurrence of a fault may have a chain reaction, which causes the abnormality of multiple tasks or machines to issue a series of alarms, and the situation in which the chain reaction is caused by the fault is called a scenario, so the scenario is a description mode of the series of alarms generated by the fault. That is, the alarm information of a plurality of alarms may be included in one scenario.

For example, on a certain day at 9 am: 00, all the situations in the first seven days of the day can be selected as historical situations, that is, all the alarms in the first seven days are considered when analyzing the alarm, that is, all the faults in the first seven days are considered.

In practical applications, step S210 may specifically include the following steps:

s211, calculating a co-occurrence graph distance, a local sensitive hash distance and an alarm time difference between the alarm and each historical alarm in each historical situation according to the alarm information;

because multiple alarms may exist in a historical situation, when the similarity between the current alarm and the historical situation is calculated, the similarity between the current alarm and each alarm in the historical situation is calculated firstly; then, according to the similarity between the current alarm and each alarm in the historical situation, the similarity between the current alarm and the historical situation is calculated. For example, the average of the similarity between the current alarm and each alarm in the historical situation is used as the similarity between the current alarm and the historical situation.

When calculating the similarity between the alarm and an alarm in a historical situation, various indexes may be considered, such as a co-occurrence graph distance, an alarm time difference, a local sensitive hash distance (e.g., a simhash distance), and the like between the alarm and the alarm in the historical situation.

The distance of the co-occurrence graph is determined according to the alarm source and the alarm co-occurrence graph of the alarm, and the basic assumption of the alarm co-occurrence graph is that the alarms which frequently occur at the same time often have certain causal relationship, so the situation clustering can be effectively helped by counting the co-occurrence relationship of the alarms in a period of time. The alarm co-occurrence graph is composed of nodes and edges connecting the nodes, the nodes can be alarm sources, alarm types, faults described in the abnormal description information and the like or the combination of the alarm sources, the alarm types and the faults, the weight of the edges is the reciprocal of the co-occurrence times of the nodes, and the smaller the distance between the two nodes is, the stronger the association between the alarms represented by the two nodes is. The construction method of the alarm co-occurrence graph comprises the following steps: a time window is set for buffering the alarms in the period of time, for example, if the time window is set to one month, only the alarms in the month are considered, and an alarm co-occurrence graph is constructed for all the alarms in the month. When a new alarm is received, adding an edge for the new alarm and all alarms in the current window in the co-occurrence alarm graph, and updating the weight values of all edges. Therefore, the co-occurrence graph distance between the current alarm and the alarm in the historical situation is the distance between the edge added by the current alarm and the edge of the alarm in the historical situation, and is the length of the edge between the nodes connecting the two alarms.

The locality sensitive hash distance is determined according to the abnormal description information in the alarm information. Firstly, entity extraction is carried out on abnormal description information to obtain an entity in the alarm; and then, calculating the locality sensitive hash distance between the entity in the alarm and the entity in the alarm in the historical situation by adopting a locality sensitive hash algorithm. The entity refers to subject information appearing in the anomaly description information. The abnormal description information is often a Chinese-English mixture, while the traditional word segmentation tool can only extract Chinese or English, while English is often a verb and a noun and the noun is often an important entity when Chinese and English are mixed, so that the extracted English words comprise combinations of English, numbers and/or special symbols, and the noun is reserved as the entity. The locality sensitive Hash distance is the difference between an entity in the abnormal description information in the alarm and a description entity of the alarm in a historical situation, the locality sensitive Hash distance is calculated by a locality sensitive Hash algorithm, the locality sensitive Hash algorithm is usually used for finding small text modification, and the alarm caused by a fault is usually only slightly different, so that the locality sensitive Hash can effectively find the alarm.

Therefore, the similarity among the text entities is calculated through the locality sensitive hashing algorithm, the alarm co-occurrence graph fully utilizes the co-occurrence relation of the alarms, the defect that the locality sensitive hashing algorithm only utilizes text information is overcome, and the context clustering from multiple angles is facilitated.

S212, calculating the similarity between the alarm and the historical situation according to the co-occurrence graph distance, the local sensitive hash distance and the alarm time difference between the alarm and each historical alarm in each historical situation.

The specific similarity calculation process can be various, and one of the processes is as follows: setting a threshold value and a weight value for each index: a co-occurrence map distance threshold value a1, a co-occurrence map distance weight value a2, a simhash distance threshold value b1, a simhash distance weight value b2, an alarm time difference threshold value c1 and an alarm time weight value c 2; then, the co-occurrence graph distance between the alarm in the current alarm and the historical situation is differentiated from a co-occurrence graph distance threshold value a1 and multiplied by a co-occurrence graph distance weighted value a2, the simhash distance between the alarm in the current alarm and the historical situation is differentiated from a simhash distance threshold value b1 and multiplied by a simhash distance weighted value b2, the alarm time between the alarm in the current alarm and the historical situation is differentiated from an alarm time threshold value c1 and multiplied by an alarm time weighted value c2, and finally the three multiplication results are summed to serve as the similarity of the alarm in the current alarm and the historical situation. This is only one way to calculate the similarity between alarms, but there are other ways, which are not listed here.

S220, determining a situation for searching a fault root cause of the current alarm according to the similarity between the current alarm and a plurality of historical situations;

it is understood that the fault root cause refers to a root cause of a fault, and the fault root cause of the current alarm may not be a warning source in the current alarm, but may be a warning source in other alarms, and because a chain reaction may occur in the occurrence of a fault, a series of abnormalities may be caused, and thus multiple alarms may be caused, and therefore, when the fault root cause of the current alarm is found, the fault root cause is not only found in the current alarm, but also found in other alarms. Here, a scenario is first determined, and then a fault root is found in this scenario, which is to find the fault root in some alarms having an association.

In practical applications, as shown in fig. 2, step S220 may include the following steps:

s221, judging whether a history situation with the similarity larger than a first threshold exists: and if not, establishing a new situation, adding the alarm information corresponding to the alarm into the new situation, and taking the new situation as the situation for searching the fault root cause of the alarm.

It can be understood that if the similarity between a history context and the current alarm is very large, it indicates that the history context is very similar to the current alarm, and it may be further considered whether to search for a fault root in the history context.

The first threshold is determined in the training process of the machine learning model, and when the finally determined fault root factor is the feedback of the real fault root factor according to the method provided by the application, the first threshold can be adjusted.

It can be understood that a new situation is created, and the alarm information corresponding to the alarm is added to the new situation, that is, only the alarm is currently provided in the new situation.

It can be understood that there may be a historical situation where the similarity with the current alarm is greater than the first threshold, and there may also be no historical situation where the similarity with the current alarm is greater than the first threshold. When the alarm does not exist, the historical situations are far different from the alarm, so that the fault root is searched in the alarm source of the alarm.

In practical applications, when there is a history scenario with similarity greater than the first threshold with the current alarm, there may be only one history scenario or a plurality of history scenarios, and when there is only one history scenario, the fault root cause may be searched in the current alarm and the one history scenario, that is, as shown in fig. 2, the step S220 may further include the following steps:

s222, if the historical situations with the similarity larger than the first threshold exist, judging whether the number of the historical situations with the similarity larger than the first threshold is 1: if so, adding the alarm information of the alarm to the historical situation with the similarity larger than the first threshold value with the alarm of the current time, and taking the historical situation with the added alarm information of the alarm of the current time as the situation for searching the fault root cause of the alarm of the current time.

It can be understood that when the similarity between only one history situation and the current alarm is greater than the first threshold, the history situation is updated, that is, the alarm information of the current alarm is added to the history situation, and the updated history situation is used as the situation for searching the fault root cause of the current alarm later. Here, the update to the historical context is also actually a kind of clustering of alarms or contexts.

However, if the similarity between a plurality of historical situations and the current alarm is greater than the first threshold, the historical situations with higher similarity can be selected for fusion, and the fault root is searched in the fused situations. That is, as shown in fig. 2, the step S220 may further include the steps of:

s223, if the number of the history situations with the similarity greater than the first threshold value is greater than 1, adding the alarm information of the current alarm to the history situation with the highest similarity to the current alarm, and judging whether the similarity between the history situation with the alarm information of the current alarm and the history situation with the second highest similarity to the current alarm is greater than a second threshold value: if yes, the historical situation of the alarm information added with the alarm and the historical situation with the highest similarity of the alarm are fused, and the situation obtained through fusion is used as the situation for searching the fault root cause of the alarm.

It can be understood that after the alarm information of the current alarm is added to the history situation with the highest similarity to the current alarm, the similarity between the history situation to which the alarm information of the current alarm is added and the history situation with the second highest similarity to the current alarm is calculated, and then whether the similarity between the two situations is greater than a second threshold value is judged.

The historical situation of the alarm information added with the current alarm is the historical situation with the highest similarity to the current alarm before the alarm information added with the current alarm. The second highest is only lower than the highest, that is, only two history situations are considered here, one is the history situation with the highest similarity to the alarm of this time, and the other is the history situation with the second highest similarity to the alarm of this time.

It will be appreciated that the primary role of context fusion is to merge similar contexts into one context. Some situations have larger alarm difference generated at the beginning and can be clustered into a plurality of situations, the situation contents can tend to be similar along with the time, and the situation fusion mainly fuses the situations, thereby reducing the workload of operation and maintenance personnel for eliminating errors.

If the similarity between the history situation added with the alarm information of the alarm and the history situation with the highest similarity with the alarm is high, the two situations are very similar, so that the two situations are fused, and the fused situation is used as the situation for searching the fault root cause of the alarm. However, if the similarity between the two situations is very low, the two situations are not suitable for fusion, and it is more suitable to search the fault root only in the historical situation of adding the alarm information of the alarm. That is, as shown in fig. 2, the step S220 may further include the steps of:

s224, if the similarity between the history situation added with the alarm information of the current alarm and the history situation with the highest similarity of the current alarm is less than or equal to a second threshold value, taking the history situation added with the alarm information of the current alarm as a situation for searching the fault root cause of the current alarm.

The greater the similarity between two contexts, the closer the distance between the two contexts is, the more the similarity between the two contexts can be calculated according to the similarity between the alarms in the two contexts, and the lowest similarity between the alarm in one context and the alarm in another context can be used as the similarity between the two contexts because there may be multiple alarms in the contexts.

And S230, calculating the importance of each warning source in the situation, and determining the fault root cause of the warning according to the importance of each warning source in the situation.

It will be appreciated that the more important the warning source in a scenario, the greater its probability of being a root cause of a fault.

In practical applications, a plurality of factors may be considered in calculating the importance of the alarm source, such as the PageRank value of the alarm source on the system call graph (which may also be referred to as the rank value of the degree of dependence of the alarm source on the system call graph), the alarm time, the frequency of generating the alarm by the alarm source, and the like. That is, the step S230 may specifically include:

and calculating the importance of each alarm source according to the alarm generation frequency, the alarm generation time and the depended degree ranking value in the system call graph of the alarm source in the situation.

The system call graph is a real call relation graph in an actual machine to which the service is set, and the alarm generated by the node with the high importance in the system call graph is often more important, so that the importance of the alarm can be more effectively represented by using the system call graph. Few algorithms in existing applications can take full advantage of the system call graph for root cause recommendation. The higher the ranking of the warning sources in the system call graph, the larger the value, the more important the warning sources are, i.e., the more likely the warning source corresponding to the node with the higher PageRank value is to be the root cause of the fault.

The reason for considering the alarm time is: the earlier an alarm is generated, the more likely it is a root cause. The reason for considering the frequency with which the alert source generates the alert is: the lower the alarm frequency, the more likely the alarm source corresponding to the node is to be the root cause of the fault.

Because three factors are considered when the importance of the warning source is calculated, each factor has a weight value, and the weight value can be determined during the training of the machine learning model and adjusted in the using process of the machine learning model. The sum of the weighted values of the three factors should be 1, but when new experience is added, that is, when a new alarm message is input, the sum of the weighted values may change, and at this time, normalization is required, so that the sum of the weighted values is still 1.

In an actual application scenario, different service programs and different operation and maintenance personnel have great difference on the granularity of fault division, and the same set of parameters cannot be applied to all situations. Meanwhile, in the process of program operation, the invention can dynamically adjust the parameters through the feedback of the result, thereby achieving better effect. The feedback of the result indicates whether the warning source output by the machine learning model is the true fault root, the similarity or difference between the warning source output by the machine learning model and the true fault root, and the like. The parameters involved include a plurality of parameters, for example, a first threshold, a second threshold, a time window of a co-occurrence graph, a co-occurrence graph distance threshold, a simHash threshold, an alarm time difference threshold, a normalization parameter or a weight value according to different experiences during recommendation, and the like. These parameters may be obtained by means of a grid search. In the running process, parameters can be adjusted in time according to the feedback of the root cause, so that the model achieves a better effect.

According to the operation and maintenance fault root cause positioning method, firstly, the alarm information of the alarm is obtained, then the alarm information is input into the machine learning model, and the machine learning model determines the fault root cause of the alarm according to the alarm information. In the whole process, excessive participation of operation and maintenance personnel is not needed, so that high requirements on the operation and maintenance personnel are not needed, and time and labor are saved. Further, in the process of determining the fault root cause of the current alarm, the machine learning model firstly calculates the similarity between the current alarm and a plurality of historical situations, then determines the situation of searching the fault root cause of the current alarm according to the similarity between the current alarm and the plurality of historical situations, and finally determines the fault root cause in the situation. The fault root cause causing the current alarm may not be the alarm source in the current alarm, and may be the alarm source in other alarms, because the occurrence of a fault has a chain reaction, which may cause a series of abnormalities, and therefore multiple alarms may be caused, so that the fault root cause for the current alarm is not only searched in the current alarm, but also searched in other alarms, so a situation for searching the fault root cause is determined according to the similarity, and the alarms in the situation have a certain relationship, so that the accuracy of the finally determined fault root cause is higher. Therefore, the method and the device consider the linkage effect of the fault, namely consider the correlation among a plurality of warning sources, perform global analysis on the root cause positioning from the global angle, and improve the accuracy of the root cause positioning. That is to say, this application will carry out the simultaneous consideration with many machines, no longer confine to single machine, this method can reduce wrong root cause recommendation or early warning by a wide margin, reduces the cost of artifical troubleshooting mistake, and this application is applicable to the fault root cause location under a plurality of application scenarios moreover.

In a second aspect, an embodiment of the present application provides a situation-based operation and maintenance fault root cause locating apparatus, as shown in fig. 3, the apparatus 300 includes:

an information obtaining module 310, configured to obtain alarm information corresponding to the alarm, where the alarm information includes an alarm source, alarm time, and exception description information generated in a service operation process;

a root cause determining module 320, configured to input the warning information into a pre-trained machine learning model to obtain a corresponding fault root cause; wherein the process of the machine learning model determining the fault root cause according to the alarm information comprises: according to the alarm information, calculating the similarity between the alarm and a plurality of historical situations; each history situation comprises alarm information corresponding to the history alarm in the corresponding history time period; determining a situation for searching a fault root cause of the alarm according to the similarity between the alarm and a plurality of historical situations; and calculating the importance of each warning source in the situation, and determining the fault root cause of the warning according to the importance of each warning source in the situation.

That is, the machine learning model includes the following units:

the similarity calculation unit is used for calculating the similarity between the alarm and a plurality of historical situations respectively according to the alarm information; each history situation comprises alarm information corresponding to the history alarm in the corresponding history time period;

the situation determining unit is used for determining the situation of searching the fault root cause of the current alarm according to the similarity between the current alarm and a plurality of historical situations;

and the root cause determining unit is used for calculating the importance of each warning source in the situation and determining the fault root cause of the warning according to the importance of each warning source in the situation.

In some embodiments, the similarity calculation unit is specifically configured to: calculating a co-occurrence graph distance, a local sensitive hash distance and an alarm time difference between the alarm and each historical alarm in each historical situation according to the alarm information; and calculating the similarity between the alarm and each historical situation according to the co-occurrence graph distance, the local sensitive hash distance and the alarm time difference between the alarm and each historical alarm in each historical situation.

In some embodiments, the context determining unit is specifically configured to: judging whether a history situation with the similarity greater than a first threshold exists or not: and if not, establishing a new situation, adding the alarm information corresponding to the alarm into the new situation, and taking the new situation as the situation for searching the fault root cause of the alarm.

In some embodiments, the context determining unit is further specifically configured to: if the historical situations with the similarity larger than the first threshold value with the current alarm exist, judging whether the quantity of the historical situations with the similarity larger than the first threshold value with the current alarm is 1: if so, adding the alarm information of the alarm to the historical situation with the similarity larger than the first threshold value with the alarm of the current time, and taking the historical situation with the added alarm information of the alarm of the current time as the situation for searching the fault root cause of the alarm of the current time.

In some embodiments, the context determining unit is further specifically configured to: if the number of the history situations with the similarity greater than the first threshold value is greater than 1, adding the alarm information of the alarm to the history situation with the highest similarity to the alarm, and judging whether the similarity between the history situation with the alarm information of the alarm and the history situation with the highest similarity to the alarm is greater than a second threshold value: if yes, the historical situation of the alarm information added with the alarm and the historical situation with the highest similarity of the alarm are fused, and the situation obtained through fusion is used as the situation for searching the fault root cause of the alarm.

In some embodiments, the context determining unit is further specifically configured to: and if the similarity between the history situation added with the alarm information of the current alarm and the history situation with the highest similarity with the current alarm is less than or equal to a second threshold value, taking the history situation added with the alarm information of the current alarm as a situation for searching the fault root cause of the current alarm.

In some embodiments, the root cause determination unit is specifically configured to: and calculating the importance of each alarm source according to the alarm generation frequency, the alarm generation time and the depended degree ranking value in the system call graph of the alarm source in the situation.

In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method provided in the first aspect when executing the computer program.

FIG. 4 is a diagram illustrating an internal structure of a computer device in one embodiment. As shown in fig. 4, the computer apparatus includes a processor, a memory, a network interface, an input device, a display screen, and the like, which are connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and also stores a computer program, and when the computer program is executed by a processor, the computer program can enable the processor to realize the situation-based operation and maintenance fault root cause positioning method. The internal memory may also store a computer program, and when the computer program is executed by the processor, the computer program may enable the processor to execute a scenario-based operation and maintenance fault root cause location method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the context-based operation and maintenance fault root cause locating apparatus provided in the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in fig. 4. The memory of the computer device may store various program modules constituting the positioning apparatus, and the computer program constituted by the various program modules makes the processor execute the steps in the operation and maintenance fault root positioning of the various embodiments of the present application described in the present specification.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the method provided in the first aspect.

It is understood that the apparatus provided in the second aspect, the computer device provided in the third aspect, and the storage medium provided in the fourth aspect all correspond to the method provided in the first aspect, and for the explanation, examples, and beneficial effects of the related contents, etc., reference may be made to corresponding parts in the first aspect, and details are not described here.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A situation-based operation and maintenance fault root cause positioning method is characterized by comprising the following steps:

inputting the alarm information into a machine learning model trained in advance to obtain a corresponding fault root;

according to the alarm information, calculating the similarity between the alarm and a plurality of historical situations; each history situation comprises alarm information corresponding to the history alarm in the corresponding history time period;

determining a situation for searching a fault root cause of the alarm according to the similarity between the alarm and a plurality of historical situations;

and calculating the importance of each warning source in the situation, and determining the fault root cause of the warning according to the importance of each warning source in the situation.

2. The method according to claim 1, wherein the calculating the similarity between the current alarm and a plurality of historical situations according to the alarm information comprises:

calculating a co-occurrence graph distance, a local sensitive hash distance and an alarm time difference between the alarm and each historical alarm in each historical situation according to the alarm information;

and calculating the similarity between the alarm and each historical situation according to the co-occurrence graph distance, the local sensitive hash distance and the alarm time difference between the alarm and each historical alarm in each historical situation.

3. The method according to claim 1, wherein the determining a situation for finding a fault root of the current alarm according to the similarities between the current alarm and the plurality of historical situations respectively comprises:

judging whether a history situation with the similarity greater than a first threshold exists or not:

and if not, establishing a new situation, adding the alarm information corresponding to the alarm into the new situation, and taking the new situation as the situation for searching the fault root cause of the alarm.

4. The method according to claim 3, wherein the determining a scenario for finding a fault root of the current alarm according to the similarities between the current alarm and the plurality of historical scenarios respectively further comprises:

if the historical situations with the similarity larger than the first threshold value with the current alarm exist, judging whether the quantity of the historical situations with the similarity larger than the first threshold value with the current alarm is 1:

if so, adding the alarm information of the current alarm to the historical situation with the similarity greater than the first threshold value with the current alarm, and taking the historical situation with the added alarm information of the current alarm as the situation for searching the fault root cause of the current alarm.

5. The method according to claim 4, wherein the determining the situation of finding the fault root of the current alarm according to the similarity between the current alarm and the plurality of historical situations respectively further comprises:

if the number of the history situations with the similarity greater than the first threshold value is greater than 1, adding the alarm information of the alarm to the history situation with the highest similarity to the alarm, and judging whether the similarity between the history situation with the alarm information of the alarm and the history situation with the highest similarity to the alarm is greater than a second threshold value:

if yes, the historical situation of the alarm information added with the alarm and the historical situation with the highest similarity of the alarm are fused, and the situation obtained through fusion is used as the situation for searching the fault root cause of the alarm.

6. The method according to claim 5, wherein the determining a scenario for finding a fault root of the current alarm according to the similarities between the current alarm and the plurality of historical scenarios respectively further comprises:

and if the similarity between the history situation added with the alarm information of the current alarm and the history situation with the highest similarity with the current alarm is less than or equal to a second threshold value, taking the history situation added with the alarm information of the current alarm as a situation for searching the fault root cause of the current alarm.

7. The method according to any one of claims 1 to 6, wherein the calculating the importance of each warning source in the context comprises:

8. A situation-based operation and maintenance fault root cause positioning device is characterized by comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.