[go: up one dir, main page]

CN110309009B - Situation-based operation and maintenance fault root cause positioning method, device, equipment and medium - Google Patents

Situation-based operation and maintenance fault root cause positioning method, device, equipment and medium Download PDF

Info

Publication number
CN110309009B
CN110309009B CN201910421407.4A CN201910421407A CN110309009B CN 110309009 B CN110309009 B CN 110309009B CN 201910421407 A CN201910421407 A CN 201910421407A CN 110309009 B CN110309009 B CN 110309009B
Authority
CN
China
Prior art keywords
alarm
situation
root cause
similarity
historical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910421407.4A
Other languages
Chinese (zh)
Other versions
CN110309009A (en
Inventor
姚斯宇
朱品燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yunji Zhizao Technology Co ltd
Original Assignee
Beijing Yunji Zhizao Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yunji Zhizao Technology Co ltd filed Critical Beijing Yunji Zhizao Technology Co ltd
Priority to CN201910421407.4A priority Critical patent/CN110309009B/en
Publication of CN110309009A publication Critical patent/CN110309009A/en
Application granted granted Critical
Publication of CN110309009B publication Critical patent/CN110309009B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/20Administration of product repair or maintenance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Resources & Organizations (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Alarm Systems (AREA)

Abstract

The invention provides a situation-based operation and maintenance fault root cause positioning method, a situation-based operation and maintenance fault root cause positioning device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring alarm information corresponding to the alarm; inputting the alarm information into a machine learning model trained in advance to obtain a corresponding fault root; the process that the machine learning model determines the fault root cause according to the alarm information comprises the following steps: according to the alarm information, calculating the similarity between the alarm and a plurality of historical situations; determining a situation for searching a fault root cause of the alarm according to the similarity between the alarm and a plurality of historical situations; and calculating the importance of each warning source in the situation, and determining the fault root cause of the warning according to the importance of each warning source in the situation. The method does not need to have high requirements on operation and maintenance personnel, saves time and labor, considers the linkage effect of faults, performs global analysis on root cause positioning from the global angle and improves the accuracy of root cause positioning.

Description

Situation-based operation and maintenance fault root cause positioning method, device, equipment and medium
Technical Field
The invention relates to the technical field of fault location, in particular to a situation-based operation and maintenance fault root cause location method and device, computer equipment and a storage medium.
Background
Various faults can occur in the service operation process, experienced operation and maintenance personnel are often required to read a large amount of system alarm information, the fault root is analyzed and deduced by means of domain knowledge, the solution needs to have higher requirements on the operation and maintenance personnel, the consumption of manpower and material resources is higher, and meanwhile, the positioning result of the fault root is often subjected to certain deviation due to the limitation of professional domain knowledge of the operation and maintenance personnel.
Therefore, related technicians provide a method for searching fault root causes by a sequence analysis abnormity mode, the method mainly utilizes the sequence analysis mode to carry out data abnormity analysis, namely historical data is utilized to train a sequence prediction model, such as a moving average model, an LSTM sequence prediction model and the like, the abnormity detection is carried out by utilizing the difference between the value predicted by the model and the real data index of the machine, and when the difference exceeds a threshold value, the machine is considered to be in fault. The method mainly aims at the abnormity detection of a specific machine index, when a plurality of machines exist, different models need to be trained to detect simultaneously, and a plurality of models consume a large amount of computing power; meanwhile, a plurality of models can generate a large amount of false reports, so that a large amount of labor consumption is brought to operation and maintenance. In addition, the sequence analysis anomaly technology does not consider the linkage effect of faults, the modeling of the association relation among multiple machines is not sufficient, and the association existing among the anomalies cannot be detected, so that the root cause positioning cannot be globally analyzed from the global perspective.
Disclosure of Invention
In order to solve the technical problems or at least partially solve the technical problems, the invention provides a method, a device, a computer device and a storage medium for positioning an operation and maintenance fault root cause based on a situation.
In a first aspect, the present invention provides a method for locating an operation and maintenance fault root cause based on a situation, including:
acquiring alarm information corresponding to the alarm, wherein the alarm information comprises an alarm source, alarm time and abnormal description information which are generated in the service operation process;
inputting the alarm information into a machine learning model trained in advance to obtain a corresponding fault root cause; wherein the process of the machine learning model determining the fault root cause according to the alarm information comprises: according to the alarm information, calculating the similarity between the alarm and a plurality of historical situations; each history situation comprises alarm information corresponding to the history alarm in the corresponding history time period; determining a situation for searching a fault root cause of the alarm according to the similarity between the alarm and a plurality of historical situations; and calculating the importance of each warning source in the situation, and determining the fault root cause of the warning according to the importance of each warning source in the situation.
In a second aspect, the present invention provides a situation-based operation and maintenance fault root cause positioning apparatus, including:
the information acquisition module is used for acquiring alarm information corresponding to the alarm, wherein the alarm information comprises an alarm source, alarm time and abnormal description information which are generated in the service operation process;
the root cause determining module is used for inputting the alarm information into a machine learning model trained in advance to obtain a corresponding fault root cause; wherein the process of the machine learning model determining the fault root cause according to the alarm information comprises: according to the alarm information, calculating the similarity between the alarm and a plurality of historical situations; each history situation comprises alarm information corresponding to the history alarm in the corresponding history time period; determining a situation for searching a fault root cause of the alarm according to the similarity between the alarm and a plurality of historical situations; and calculating the importance of each warning source in the situation, and determining the fault root cause of the warning according to the importance of each warning source in the situation.
In a third aspect, the present invention provides a computer device comprising a processor and a computer program stored on a memory and executable on the processor, the processor implementing the steps of the method when executing the computer program.
In a fourth aspect, the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described method.
The invention provides a situation-based operation and maintenance fault root cause positioning method, a situation-based operation and maintenance fault root cause positioning device, computer equipment and a storage medium. In the whole process, excessive participation of operation and maintenance personnel is not needed, so that high requirements on the operation and maintenance personnel are not needed, and time and labor are saved. Further, in the process of determining the fault root cause of the current alarm, the machine learning model firstly calculates the similarity between the current alarm and a plurality of historical situations, then determines the situation of searching the fault root cause of the current alarm according to the similarity between the current alarm and the plurality of historical situations, and finally determines the fault root cause in the situation. The fault root cause causing the current alarm may not be the alarm source in the current alarm, and may be the alarm source in other alarms, because the occurrence of a fault has a chain reaction, which may cause a series of abnormalities, and therefore multiple alarms may be caused, so that the fault root cause for the current alarm is not only searched in the current alarm, but also searched in other alarms, so a situation for searching the fault root cause is determined according to the similarity, and the alarms in the situation have a certain relationship, so that the accuracy of the finally determined fault root cause is higher. Therefore, the method and the device consider the linkage effect of the fault, namely consider the correlation among a plurality of warning sources, perform global analysis on the root cause positioning from the global angle, and improve the accuracy of the root cause positioning.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart illustrating a method for locating a cause of a situation-based operation and maintenance fault according to an embodiment of the present disclosure;
FIG. 2 is a schematic flow chart illustrating the process of determining the fault root according to the alarm information by the machine learning model in the embodiment of the present application;
FIG. 3 is a block diagram of a situation-based operation and maintenance fault root cause locating device in an embodiment of the present application;
fig. 4 is a block diagram of a computer device in the embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In a first aspect, an embodiment of the present application provides a method for locating an operation and maintenance fault root cause based on a context, as shown in fig. 1, the method includes:
s100, acquiring alarm information corresponding to the alarm, wherein the alarm information comprises an alarm source, alarm time and abnormal description information generated in the service operation process;
it can be understood that the alarm information refers to abnormal log information generated during the operation of the service program, and generally includes an alarm source, an alarm time, and abnormal description information, and may also include information such as an alarm level and a monitor program type.
The warning source refers to a machine with an abnormality, for example, a server in a background server cluster of a certain shopping website. In the alarm information of one alarm, a plurality of alarm sources may exist, that is, there is a case where abnormality occurs in a plurality of machines.
The exception description information refers to description information of an exception condition, for example, exception data information of some modules (processors) in the machine.
S200, inputting the alarm information into a machine learning model trained in advance to obtain a corresponding fault root;
it can be understood that the alarm information of the current alarm is input into the machine learning model trained in advance, and the machine learning model outputs the fault root of the current alarm.
Wherein the process of the machine learning model determining the fault root cause according to the alarm information comprises:
s210, calculating the similarity between the current alarm and a plurality of historical situations respectively according to the alarm information; each history situation comprises alarm information corresponding to the history alarm in the corresponding history time period;
it can be understood that, generally, the occurrence of a fault may have a chain reaction, which causes the abnormality of multiple tasks or machines to issue a series of alarms, and the situation in which the chain reaction is caused by the fault is called a scenario, so the scenario is a description mode of the series of alarms generated by the fault. That is, the alarm information of a plurality of alarms may be included in one scenario.
For example, on a certain day at 9 am: 00, all the situations in the first seven days of the day can be selected as historical situations, that is, all the alarms in the first seven days are considered when analyzing the alarm, that is, all the faults in the first seven days are considered.
In practical applications, step S210 may specifically include the following steps:
s211, calculating a co-occurrence graph distance, a local sensitive hash distance and an alarm time difference between the alarm and each historical alarm in each historical situation according to the alarm information;
because multiple alarms may exist in a historical situation, when the similarity between the current alarm and the historical situation is calculated, the similarity between the current alarm and each alarm in the historical situation is calculated firstly; then, according to the similarity between the current alarm and each alarm in the historical situation, the similarity between the current alarm and the historical situation is calculated. For example, the average of the similarity between the current alarm and each alarm in the historical situation is used as the similarity between the current alarm and the historical situation.
When calculating the similarity between the alarm and an alarm in a historical situation, various indexes may be considered, such as a co-occurrence graph distance, an alarm time difference, a local sensitive hash distance (e.g., a simhash distance), and the like between the alarm and the alarm in the historical situation.
The distance of the co-occurrence graph is determined according to the alarm source and the alarm co-occurrence graph of the alarm, and the basic assumption of the alarm co-occurrence graph is that the alarms which frequently occur at the same time often have certain causal relationship, so the situation clustering can be effectively helped by counting the co-occurrence relationship of the alarms in a period of time. The alarm co-occurrence graph is composed of nodes and edges connecting the nodes, the nodes can be alarm sources, alarm types, faults described in the abnormal description information and the like or the combination of the alarm sources, the alarm types and the faults, the weight of the edges is the reciprocal of the co-occurrence times of the nodes, and the smaller the distance between the two nodes is, the stronger the association between the alarms represented by the two nodes is. The construction method of the alarm co-occurrence graph comprises the following steps: a time window is set for buffering the alarms in the period of time, for example, if the time window is set to one month, only the alarms in the month are considered, and an alarm co-occurrence graph is constructed for all the alarms in the month. When a new alarm is received, adding an edge for the new alarm and all alarms in the current window in the co-occurrence alarm graph, and updating the weight values of all edges. Therefore, the co-occurrence graph distance between the current alarm and the alarm in the historical situation is the distance between the edge added by the current alarm and the edge of the alarm in the historical situation, and is the length of the edge between the nodes connecting the two alarms.
The locality sensitive hash distance is determined according to the abnormal description information in the alarm information. Firstly, entity extraction is carried out on abnormal description information to obtain an entity in the alarm; and then, calculating the locality sensitive hash distance between the entity in the alarm and the entity in the alarm in the historical situation by adopting a locality sensitive hash algorithm. The entity refers to subject information appearing in the anomaly description information. The abnormal description information is often a Chinese-English mixture, while the traditional word segmentation tool can only extract Chinese or English, while English is often a verb and a noun and the noun is often an important entity when Chinese and English are mixed, so that the extracted English words comprise combinations of English, numbers and/or special symbols, and the noun is reserved as the entity. The locality sensitive Hash distance is the difference between an entity in the abnormal description information in the alarm and a description entity of the alarm in a historical situation, the locality sensitive Hash distance is calculated by a locality sensitive Hash algorithm, the locality sensitive Hash algorithm is usually used for finding small text modification, and the alarm caused by a fault is usually only slightly different, so that the locality sensitive Hash can effectively find the alarm.
Therefore, the similarity among the text entities is calculated through the locality sensitive hashing algorithm, the alarm co-occurrence graph fully utilizes the co-occurrence relation of the alarms, the defect that the locality sensitive hashing algorithm only utilizes text information is overcome, and the context clustering from multiple angles is facilitated.
S212, calculating the similarity between the alarm and the historical situation according to the co-occurrence graph distance, the local sensitive hash distance and the alarm time difference between the alarm and each historical alarm in each historical situation.
The specific similarity calculation process can be various, and one of the processes is as follows: setting a threshold value and a weight value for each index: a co-occurrence map distance threshold value a1, a co-occurrence map distance weight value a2, a simhash distance threshold value b1, a simhash distance weight value b2, an alarm time difference threshold value c1 and an alarm time weight value c 2; then, the co-occurrence graph distance between the alarm in the current alarm and the historical situation is differentiated from a co-occurrence graph distance threshold value a1 and multiplied by a co-occurrence graph distance weighted value a2, the simhash distance between the alarm in the current alarm and the historical situation is differentiated from a simhash distance threshold value b1 and multiplied by a simhash distance weighted value b2, the alarm time between the alarm in the current alarm and the historical situation is differentiated from an alarm time threshold value c1 and multiplied by an alarm time weighted value c2, and finally the three multiplication results are summed to serve as the similarity of the alarm in the current alarm and the historical situation. This is only one way to calculate the similarity between alarms, but there are other ways, which are not listed here.
S220, determining a situation for searching a fault root cause of the current alarm according to the similarity between the current alarm and a plurality of historical situations;
it is understood that the fault root cause refers to a root cause of a fault, and the fault root cause of the current alarm may not be a warning source in the current alarm, but may be a warning source in other alarms, and because a chain reaction may occur in the occurrence of a fault, a series of abnormalities may be caused, and thus multiple alarms may be caused, and therefore, when the fault root cause of the current alarm is found, the fault root cause is not only found in the current alarm, but also found in other alarms. Here, a scenario is first determined, and then a fault root is found in this scenario, which is to find the fault root in some alarms having an association.
In practical applications, as shown in fig. 2, step S220 may include the following steps:
s221, judging whether a history situation with the similarity larger than a first threshold exists: and if not, establishing a new situation, adding the alarm information corresponding to the alarm into the new situation, and taking the new situation as the situation for searching the fault root cause of the alarm.
It can be understood that if the similarity between a history context and the current alarm is very large, it indicates that the history context is very similar to the current alarm, and it may be further considered whether to search for a fault root in the history context.
The first threshold is determined in the training process of the machine learning model, and when the finally determined fault root factor is the feedback of the real fault root factor according to the method provided by the application, the first threshold can be adjusted.
It can be understood that a new situation is created, and the alarm information corresponding to the alarm is added to the new situation, that is, only the alarm is currently provided in the new situation.
It can be understood that there may be a historical situation where the similarity with the current alarm is greater than the first threshold, and there may also be no historical situation where the similarity with the current alarm is greater than the first threshold. When the alarm does not exist, the historical situations are far different from the alarm, so that the fault root is searched in the alarm source of the alarm.
In practical applications, when there is a history scenario with similarity greater than the first threshold with the current alarm, there may be only one history scenario or a plurality of history scenarios, and when there is only one history scenario, the fault root cause may be searched in the current alarm and the one history scenario, that is, as shown in fig. 2, the step S220 may further include the following steps:
s222, if the historical situations with the similarity larger than the first threshold exist, judging whether the number of the historical situations with the similarity larger than the first threshold is 1: if so, adding the alarm information of the alarm to the historical situation with the similarity larger than the first threshold value with the alarm of the current time, and taking the historical situation with the added alarm information of the alarm of the current time as the situation for searching the fault root cause of the alarm of the current time.
It can be understood that when the similarity between only one history situation and the current alarm is greater than the first threshold, the history situation is updated, that is, the alarm information of the current alarm is added to the history situation, and the updated history situation is used as the situation for searching the fault root cause of the current alarm later. Here, the update to the historical context is also actually a kind of clustering of alarms or contexts.
However, if the similarity between a plurality of historical situations and the current alarm is greater than the first threshold, the historical situations with higher similarity can be selected for fusion, and the fault root is searched in the fused situations. That is, as shown in fig. 2, the step S220 may further include the steps of:
s223, if the number of the history situations with the similarity greater than the first threshold value is greater than 1, adding the alarm information of the current alarm to the history situation with the highest similarity to the current alarm, and judging whether the similarity between the history situation with the alarm information of the current alarm and the history situation with the second highest similarity to the current alarm is greater than a second threshold value: if yes, the historical situation of the alarm information added with the alarm and the historical situation with the highest similarity of the alarm are fused, and the situation obtained through fusion is used as the situation for searching the fault root cause of the alarm.
It can be understood that after the alarm information of the current alarm is added to the history situation with the highest similarity to the current alarm, the similarity between the history situation to which the alarm information of the current alarm is added and the history situation with the second highest similarity to the current alarm is calculated, and then whether the similarity between the two situations is greater than a second threshold value is judged.
The historical situation of the alarm information added with the current alarm is the historical situation with the highest similarity to the current alarm before the alarm information added with the current alarm. The second highest is only lower than the highest, that is, only two history situations are considered here, one is the history situation with the highest similarity to the alarm of this time, and the other is the history situation with the second highest similarity to the alarm of this time.
It will be appreciated that the primary role of context fusion is to merge similar contexts into one context. Some situations have larger alarm difference generated at the beginning and can be clustered into a plurality of situations, the situation contents can tend to be similar along with the time, and the situation fusion mainly fuses the situations, thereby reducing the workload of operation and maintenance personnel for eliminating errors.
If the similarity between the history situation added with the alarm information of the alarm and the history situation with the highest similarity with the alarm is high, the two situations are very similar, so that the two situations are fused, and the fused situation is used as the situation for searching the fault root cause of the alarm. However, if the similarity between the two situations is very low, the two situations are not suitable for fusion, and it is more suitable to search the fault root only in the historical situation of adding the alarm information of the alarm. That is, as shown in fig. 2, the step S220 may further include the steps of:
s224, if the similarity between the history situation added with the alarm information of the current alarm and the history situation with the highest similarity of the current alarm is less than or equal to a second threshold value, taking the history situation added with the alarm information of the current alarm as a situation for searching the fault root cause of the current alarm.
The greater the similarity between two contexts, the closer the distance between the two contexts is, the more the similarity between the two contexts can be calculated according to the similarity between the alarms in the two contexts, and the lowest similarity between the alarm in one context and the alarm in another context can be used as the similarity between the two contexts because there may be multiple alarms in the contexts.
And S230, calculating the importance of each warning source in the situation, and determining the fault root cause of the warning according to the importance of each warning source in the situation.
It will be appreciated that the more important the warning source in a scenario, the greater its probability of being a root cause of a fault.
In practical applications, a plurality of factors may be considered in calculating the importance of the alarm source, such as the PageRank value of the alarm source on the system call graph (which may also be referred to as the rank value of the degree of dependence of the alarm source on the system call graph), the alarm time, the frequency of generating the alarm by the alarm source, and the like. That is, the step S230 may specifically include:
and calculating the importance of each alarm source according to the alarm generation frequency, the alarm generation time and the depended degree ranking value in the system call graph of the alarm source in the situation.
The system call graph is a real call relation graph in an actual machine to which the service is set, and the alarm generated by the node with the high importance in the system call graph is often more important, so that the importance of the alarm can be more effectively represented by using the system call graph. Few algorithms in existing applications can take full advantage of the system call graph for root cause recommendation. The higher the ranking of the warning sources in the system call graph, the larger the value, the more important the warning sources are, i.e., the more likely the warning source corresponding to the node with the higher PageRank value is to be the root cause of the fault.
The reason for considering the alarm time is: the earlier an alarm is generated, the more likely it is a root cause. The reason for considering the frequency with which the alert source generates the alert is: the lower the alarm frequency, the more likely the alarm source corresponding to the node is to be the root cause of the fault.
Because three factors are considered when the importance of the warning source is calculated, each factor has a weight value, and the weight value can be determined during the training of the machine learning model and adjusted in the using process of the machine learning model. The sum of the weighted values of the three factors should be 1, but when new experience is added, that is, when a new alarm message is input, the sum of the weighted values may change, and at this time, normalization is required, so that the sum of the weighted values is still 1.
In an actual application scenario, different service programs and different operation and maintenance personnel have great difference on the granularity of fault division, and the same set of parameters cannot be applied to all situations. Meanwhile, in the process of program operation, the invention can dynamically adjust the parameters through the feedback of the result, thereby achieving better effect. The feedback of the result indicates whether the warning source output by the machine learning model is the true fault root, the similarity or difference between the warning source output by the machine learning model and the true fault root, and the like. The parameters involved include a plurality of parameters, for example, a first threshold, a second threshold, a time window of a co-occurrence graph, a co-occurrence graph distance threshold, a simHash threshold, an alarm time difference threshold, a normalization parameter or a weight value according to different experiences during recommendation, and the like. These parameters may be obtained by means of a grid search. In the running process, parameters can be adjusted in time according to the feedback of the root cause, so that the model achieves a better effect.
According to the operation and maintenance fault root cause positioning method, firstly, the alarm information of the alarm is obtained, then the alarm information is input into the machine learning model, and the machine learning model determines the fault root cause of the alarm according to the alarm information. In the whole process, excessive participation of operation and maintenance personnel is not needed, so that high requirements on the operation and maintenance personnel are not needed, and time and labor are saved. Further, in the process of determining the fault root cause of the current alarm, the machine learning model firstly calculates the similarity between the current alarm and a plurality of historical situations, then determines the situation of searching the fault root cause of the current alarm according to the similarity between the current alarm and the plurality of historical situations, and finally determines the fault root cause in the situation. The fault root cause causing the current alarm may not be the alarm source in the current alarm, and may be the alarm source in other alarms, because the occurrence of a fault has a chain reaction, which may cause a series of abnormalities, and therefore multiple alarms may be caused, so that the fault root cause for the current alarm is not only searched in the current alarm, but also searched in other alarms, so a situation for searching the fault root cause is determined according to the similarity, and the alarms in the situation have a certain relationship, so that the accuracy of the finally determined fault root cause is higher. Therefore, the method and the device consider the linkage effect of the fault, namely consider the correlation among a plurality of warning sources, perform global analysis on the root cause positioning from the global angle, and improve the accuracy of the root cause positioning. That is to say, this application will carry out the simultaneous consideration with many machines, no longer confine to single machine, this method can reduce wrong root cause recommendation or early warning by a wide margin, reduces the cost of artifical troubleshooting mistake, and this application is applicable to the fault root cause location under a plurality of application scenarios moreover.
In a second aspect, an embodiment of the present application provides a situation-based operation and maintenance fault root cause locating apparatus, as shown in fig. 3, the apparatus 300 includes:
an information obtaining module 310, configured to obtain alarm information corresponding to the alarm, where the alarm information includes an alarm source, alarm time, and exception description information generated in a service operation process;
a root cause determining module 320, configured to input the warning information into a pre-trained machine learning model to obtain a corresponding fault root cause; wherein the process of the machine learning model determining the fault root cause according to the alarm information comprises: according to the alarm information, calculating the similarity between the alarm and a plurality of historical situations; each history situation comprises alarm information corresponding to the history alarm in the corresponding history time period; determining a situation for searching a fault root cause of the alarm according to the similarity between the alarm and a plurality of historical situations; and calculating the importance of each warning source in the situation, and determining the fault root cause of the warning according to the importance of each warning source in the situation.
That is, the machine learning model includes the following units:
the similarity calculation unit is used for calculating the similarity between the alarm and a plurality of historical situations respectively according to the alarm information; each history situation comprises alarm information corresponding to the history alarm in the corresponding history time period;
the situation determining unit is used for determining the situation of searching the fault root cause of the current alarm according to the similarity between the current alarm and a plurality of historical situations;
and the root cause determining unit is used for calculating the importance of each warning source in the situation and determining the fault root cause of the warning according to the importance of each warning source in the situation.
In some embodiments, the similarity calculation unit is specifically configured to: calculating a co-occurrence graph distance, a local sensitive hash distance and an alarm time difference between the alarm and each historical alarm in each historical situation according to the alarm information; and calculating the similarity between the alarm and each historical situation according to the co-occurrence graph distance, the local sensitive hash distance and the alarm time difference between the alarm and each historical alarm in each historical situation.
In some embodiments, the context determining unit is specifically configured to: judging whether a history situation with the similarity greater than a first threshold exists or not: and if not, establishing a new situation, adding the alarm information corresponding to the alarm into the new situation, and taking the new situation as the situation for searching the fault root cause of the alarm.
In some embodiments, the context determining unit is further specifically configured to: if the historical situations with the similarity larger than the first threshold value with the current alarm exist, judging whether the quantity of the historical situations with the similarity larger than the first threshold value with the current alarm is 1: if so, adding the alarm information of the alarm to the historical situation with the similarity larger than the first threshold value with the alarm of the current time, and taking the historical situation with the added alarm information of the alarm of the current time as the situation for searching the fault root cause of the alarm of the current time.
In some embodiments, the context determining unit is further specifically configured to: if the number of the history situations with the similarity greater than the first threshold value is greater than 1, adding the alarm information of the alarm to the history situation with the highest similarity to the alarm, and judging whether the similarity between the history situation with the alarm information of the alarm and the history situation with the highest similarity to the alarm is greater than a second threshold value: if yes, the historical situation of the alarm information added with the alarm and the historical situation with the highest similarity of the alarm are fused, and the situation obtained through fusion is used as the situation for searching the fault root cause of the alarm.
In some embodiments, the context determining unit is further specifically configured to: and if the similarity between the history situation added with the alarm information of the current alarm and the history situation with the highest similarity with the current alarm is less than or equal to a second threshold value, taking the history situation added with the alarm information of the current alarm as a situation for searching the fault root cause of the current alarm.
In some embodiments, the root cause determination unit is specifically configured to: and calculating the importance of each alarm source according to the alarm generation frequency, the alarm generation time and the depended degree ranking value in the system call graph of the alarm source in the situation.
In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method provided in the first aspect when executing the computer program.
FIG. 4 is a diagram illustrating an internal structure of a computer device in one embodiment. As shown in fig. 4, the computer apparatus includes a processor, a memory, a network interface, an input device, a display screen, and the like, which are connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and also stores a computer program, and when the computer program is executed by a processor, the computer program can enable the processor to realize the situation-based operation and maintenance fault root cause positioning method. The internal memory may also store a computer program, and when the computer program is executed by the processor, the computer program may enable the processor to execute a scenario-based operation and maintenance fault root cause location method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, the context-based operation and maintenance fault root cause locating apparatus provided in the present application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in fig. 4. The memory of the computer device may store various program modules constituting the positioning apparatus, and the computer program constituted by the various program modules makes the processor execute the steps in the operation and maintenance fault root positioning of the various embodiments of the present application described in the present specification.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the method provided in the first aspect.
It is understood that the apparatus provided in the second aspect, the computer device provided in the third aspect, and the storage medium provided in the fourth aspect all correspond to the method provided in the first aspect, and for the explanation, examples, and beneficial effects of the related contents, etc., reference may be made to corresponding parts in the first aspect, and details are not described here.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A situation-based operation and maintenance fault root cause positioning method is characterized by comprising the following steps:
acquiring alarm information corresponding to the alarm, wherein the alarm information comprises an alarm source, alarm time and abnormal description information which are generated in the service operation process;
inputting the alarm information into a machine learning model trained in advance to obtain a corresponding fault root;
wherein the process of the machine learning model determining the fault root cause according to the alarm information comprises:
according to the alarm information, calculating the similarity between the alarm and a plurality of historical situations; each history situation comprises alarm information corresponding to the history alarm in the corresponding history time period;
determining a situation for searching a fault root cause of the alarm according to the similarity between the alarm and a plurality of historical situations;
and calculating the importance of each warning source in the situation, and determining the fault root cause of the warning according to the importance of each warning source in the situation.
2. The method according to claim 1, wherein the calculating the similarity between the current alarm and a plurality of historical situations according to the alarm information comprises:
calculating a co-occurrence graph distance, a local sensitive hash distance and an alarm time difference between the alarm and each historical alarm in each historical situation according to the alarm information;
and calculating the similarity between the alarm and each historical situation according to the co-occurrence graph distance, the local sensitive hash distance and the alarm time difference between the alarm and each historical alarm in each historical situation.
3. The method according to claim 1, wherein the determining a situation for finding a fault root of the current alarm according to the similarities between the current alarm and the plurality of historical situations respectively comprises:
judging whether a history situation with the similarity greater than a first threshold exists or not:
and if not, establishing a new situation, adding the alarm information corresponding to the alarm into the new situation, and taking the new situation as the situation for searching the fault root cause of the alarm.
4. The method according to claim 3, wherein the determining a scenario for finding a fault root of the current alarm according to the similarities between the current alarm and the plurality of historical scenarios respectively further comprises:
if the historical situations with the similarity larger than the first threshold value with the current alarm exist, judging whether the quantity of the historical situations with the similarity larger than the first threshold value with the current alarm is 1:
if so, adding the alarm information of the current alarm to the historical situation with the similarity greater than the first threshold value with the current alarm, and taking the historical situation with the added alarm information of the current alarm as the situation for searching the fault root cause of the current alarm.
5. The method according to claim 4, wherein the determining the situation of finding the fault root of the current alarm according to the similarity between the current alarm and the plurality of historical situations respectively further comprises:
if the number of the history situations with the similarity greater than the first threshold value is greater than 1, adding the alarm information of the alarm to the history situation with the highest similarity to the alarm, and judging whether the similarity between the history situation with the alarm information of the alarm and the history situation with the highest similarity to the alarm is greater than a second threshold value:
if yes, the historical situation of the alarm information added with the alarm and the historical situation with the highest similarity of the alarm are fused, and the situation obtained through fusion is used as the situation for searching the fault root cause of the alarm.
6. The method according to claim 5, wherein the determining a scenario for finding a fault root of the current alarm according to the similarities between the current alarm and the plurality of historical scenarios respectively further comprises:
and if the similarity between the history situation added with the alarm information of the current alarm and the history situation with the highest similarity with the current alarm is less than or equal to a second threshold value, taking the history situation added with the alarm information of the current alarm as a situation for searching the fault root cause of the current alarm.
7. The method according to any one of claims 1 to 6, wherein the calculating the importance of each warning source in the context comprises:
and calculating the importance of each alarm source according to the alarm generation frequency, the alarm generation time and the depended degree ranking value in the system call graph of the alarm source in the situation.
8. A situation-based operation and maintenance fault root cause positioning device is characterized by comprising:
the information acquisition module is used for acquiring alarm information corresponding to the alarm, wherein the alarm information comprises an alarm source, alarm time and abnormal description information which are generated in the service operation process;
the root cause determining module is used for inputting the alarm information into a machine learning model trained in advance to obtain a corresponding fault root cause; wherein the process of the machine learning model determining the fault root cause according to the alarm information comprises: according to the alarm information, calculating the similarity between the alarm and a plurality of historical situations; each history situation comprises alarm information corresponding to the history alarm in the corresponding history time period; determining a situation for searching a fault root cause of the alarm according to the similarity between the alarm and a plurality of historical situations; and calculating the importance of each warning source in the situation, and determining the fault root cause of the warning according to the importance of each warning source in the situation.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN201910421407.4A 2019-05-21 2019-05-21 Situation-based operation and maintenance fault root cause positioning method, device, equipment and medium Active CN110309009B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910421407.4A CN110309009B (en) 2019-05-21 2019-05-21 Situation-based operation and maintenance fault root cause positioning method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910421407.4A CN110309009B (en) 2019-05-21 2019-05-21 Situation-based operation and maintenance fault root cause positioning method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN110309009A CN110309009A (en) 2019-10-08
CN110309009B true CN110309009B (en) 2022-05-13

Family

ID=68075535

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910421407.4A Active CN110309009B (en) 2019-05-21 2019-05-21 Situation-based operation and maintenance fault root cause positioning method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN110309009B (en)

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112953738B (en) * 2019-11-26 2022-06-10 中国移动通信集团山东有限公司 Root cause alarm positioning system, method, device and computer equipment
CN111158977B (en) * 2019-12-12 2023-07-11 深圳前海微众银行股份有限公司 Abnormal event root cause positioning method and device
CN111459695B (en) * 2020-03-12 2024-09-27 平安科技(深圳)有限公司 Root cause positioning method, root cause positioning device, computer equipment and storage medium
CN113407370B (en) * 2020-03-16 2024-07-19 中国移动通信有限公司研究院 Root cause error clustering method, root cause error clustering device, root cause error clustering equipment and computer readable storage medium
CN113497716B (en) 2020-03-18 2023-03-10 华为技术有限公司 Recommended methods and related equipment for similar faults
CN111641519B (en) * 2020-04-30 2022-10-11 平安科技(深圳)有限公司 Abnormal root cause positioning method, device and storage medium
CN113872780A (en) * 2020-06-30 2021-12-31 大唐移动通信设备有限公司 Fault root cause analysis method, device and storage medium
CN112003718B (en) * 2020-09-25 2021-07-27 南京邮电大学 A network alarm location method based on deep learning
CN112181758B (en) * 2020-08-19 2023-07-28 南京邮电大学 A fault root cause location method based on network topology and real-time alarm
CN112087334B (en) * 2020-09-09 2022-10-18 中移(杭州)信息技术有限公司 Alarm root cause analysis method, electronic device and storage medium
CN114285730A (en) * 2020-09-18 2022-04-05 华为技术有限公司 Method, apparatus and related equipment for determining the root cause of failure
CN112866010B (en) * 2021-01-04 2023-01-20 聚好看科技股份有限公司 Fault positioning method and device
CN113740666B (en) * 2021-08-27 2022-12-09 西安交通大学 A method for locating root faults of alarm storm in data center power system
CN115981897A (en) * 2021-10-14 2023-04-18 北京字节跳动网络技术有限公司 Stack processing method and device
CN114036826B (en) * 2021-10-29 2025-08-22 深圳前海微众银行股份有限公司 Model training method, root cause determination method, device, equipment and storage medium
CN114237962B (en) * 2021-12-21 2024-05-14 中国电信股份有限公司 Alarm root cause judging method, model training method, device, equipment and medium
CN114325232B (en) * 2021-12-28 2023-07-25 微梦创科网络科技(中国)有限公司 A fault location method and device
CN114513802B (en) * 2022-01-04 2023-06-09 武汉烽火技术服务有限公司 Method and device for analyzing bearing network faults based on event stream
CN114090326B (en) * 2022-01-14 2022-06-03 云智慧(北京)科技有限公司 Alarm root cause determination method, device and equipment
CN114564580A (en) * 2022-02-15 2022-05-31 北京云集智造科技有限公司 An Adaptive Alarm Aggregation Method Based on Knowledge Graph
CN114742247B (en) * 2022-04-08 2024-10-22 广东电网有限责任公司 Feature extraction method and device based on distribution network distribution variation normal alarm information
CN114944956B (en) * 2022-05-27 2024-07-09 深信服科技股份有限公司 Attack link detection method and device, electronic equipment and storage medium
CN115174251B (en) * 2022-07-19 2023-09-05 深信服科技股份有限公司 False alarm identification method and device for safety alarm and storage medium
CN116582410B (en) * 2023-05-24 2023-10-27 青岛海信信息科技股份有限公司 Intelligent operation and maintenance service method and device based on ITSM system
CN116681968A (en) * 2023-05-29 2023-09-01 平安科技(深圳)有限公司 Root Cause Analysis Method, Device, Equipment and Medium Based on Graph Comparison
CN120256186A (en) * 2025-06-03 2025-07-04 苏州元脑智能科技有限公司 Server downtime analysis method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102638100A (en) * 2012-04-05 2012-08-15 华北电力大学 District power network equipment abnormal alarm signal association analysis and diagnosis method
WO2015051638A1 (en) * 2013-10-08 2015-04-16 华为技术有限公司 Fault location method and device
CN107770797A (en) * 2016-08-17 2018-03-06 中国移动通信集团内蒙古有限公司 Correlation analysis method and system for wireless network alarm management
CN108170702A (en) * 2017-11-15 2018-06-15 国网河北省电力有限公司信息通信分公司 A kind of power communication alarm association model based on statistical analysis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102638100A (en) * 2012-04-05 2012-08-15 华北电力大学 District power network equipment abnormal alarm signal association analysis and diagnosis method
WO2015051638A1 (en) * 2013-10-08 2015-04-16 华为技术有限公司 Fault location method and device
CN107770797A (en) * 2016-08-17 2018-03-06 中国移动通信集团内蒙古有限公司 Correlation analysis method and system for wireless network alarm management
CN108170702A (en) * 2017-11-15 2018-06-15 国网河北省电力有限公司信息通信分公司 A kind of power communication alarm association model based on statistical analysis

Also Published As

Publication number Publication date
CN110309009A (en) 2019-10-08

Similar Documents

Publication Publication Date Title
CN110309009B (en) Situation-based operation and maintenance fault root cause positioning method, device, equipment and medium
CN118210983B (en) Intelligent self-adaptive retrieval enhancement system, method and storage medium
US9652318B2 (en) System and method for automatically managing fault events of data center
US20230195728A1 (en) Column lineage and metadata propagation
CN113342559B (en) Diagnostic framework in computing system
CN111930547A (en) Fault positioning method and device and storage medium
US9652472B2 (en) Service requirement analysis system, method and non-transitory computer readable storage medium
US11972216B2 (en) Autonomous detection of compound issue requests in an issue tracking system
US20170372212A1 (en) Model based root cause analysis
US20130151536A1 (en) Vertex-Proximity Query Processing
US9830148B1 (en) Providing user-specific source code alert information
US20150149541A1 (en) Leveraging Social Media to Assist in Troubleshooting
CN113821418B (en) Fault root cause analysis method and device, storage medium and electronic equipment
CN120045726A (en) Container fault tracing method, device and medium based on cross-community knowledge fusion
CN119621396A (en) Fault root cause analysis method, device and storage medium for distributed software system
WO2025123876A9 (en) Method and apparatus for determining abnormal processing unit, and non-volatile storage medium
Kong et al. Enhancing fault localization in microservices systems through span-level using graph convolutional networks
US20240004747A1 (en) Processor System and Failure Diagnosis Method
Wiegert et al. Predicting phylogenetic bootstrap values via machine learning
CN114926082B (en) Data fluctuation early warning method based on artificial intelligence and related equipment
CN114048150B (en) Memory recycling anomaly detection method, device, equipment and medium
CN115587188A (en) Operation and maintenance knowledge map acquisition and application method, device, equipment and storage medium
CN118057327A (en) Information Technology (IT) system alarm data processing method and device based on knowledge graph
CN114547015A (en) Operation and maintenance management method, device, equipment and medium
US12400147B2 (en) Schema-based machine learning model monitoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant