CN108446184B

CN108446184B - Method and system for analyzing fault root cause

Info

Publication number: CN108446184B
Application number: CN201810155161.6A
Authority: CN
Inventors: 张银霞; 付铁山
Original assignee: Beijing Tianyuan Innovation Technology Co ltd
Current assignee: Beijing Tianyuan Innovation Technology Co ltd
Priority date: 2018-02-23
Filing date: 2018-02-23
Publication date: 2021-09-07
Anticipated expiration: 2038-02-23
Also published as: CN108446184A

Abstract

The invention provides a method and a system for analyzing a fault root cause, wherein the method comprises the following steps: sorting according to the time attribute of each data in the data set, and segmenting the data set according to a preset time window to obtain a plurality of groups of sub data sets; acquiring a frequent item set and an association rule in a data set according to an Apriori algorithm, wherein the frequent item set comprises a certain amount of data with strong association; and sequencing according to the time attribute of the data in the frequent item set, sequentially matching the data in the front of the sequence with the accompanying alarm reason data prestored in the alarm reason database, if the matching is successful, removing the data in the front of the sequence from the data set, and continuously comparing the data in the next data item until the matching is unsuccessful and the data in the front of the sequence is used as the root reason of the data at the last time sequence in the frequent item set. The invention is suitable for the full-dimensional monitoring scene of the IT system, relieves the pressure of operation and maintenance personnel, and reduces the quality requirement of the operation and maintenance personnel.

Description

Method and system for analyzing fault root cause

Technical Field

The present invention relates to the field of data mining technologies, and in particular, to a method and a system for analyzing a root cause of a fault.

Background

Under the development trend of informatization and digitization, the complexity of an IT system is increasingly improved, an IT framework is increasingly complex, and the current IT framework at least comprises: once a service fails, a large amount of even massive alarm/event information can be reported, so that maintenance personnel cannot accurately perform fault location in the presence of a large amount of alarm/event data, and the fault location has high quality requirements on the maintenance personnel and needs to be participated by personnel with abundant operation, maintenance and development experiences.

The fault analysis root cause method based on association rule mining is an important method for diagnosing and positioning the fault of the IT system. Most of the existing fault analysis adopts threshold value judgment, threshold values or threshold range threshold values are allowed to be set on KPIs, and alarms are reported when the thresholds are out of limits, the judgment is not very suitable for performance problem analysis of an IT system, a large number of response time out-of-limit alarms are reported when user response is slow due to network or other reasons, and most users can not influence the faults of using the IT system by the users because the faults can not affect the use of the IT system, so that the system is normal and can not pay attention to or complain about, however, once the users complain about, the performance alarms are too many, and the analysis is difficult.

Disclosure of Invention

The present invention provides a method and system for analyzing root causes of faults that overcomes or at least partially addresses the above-mentioned problems.

According to an aspect of the present invention, there is provided a method of analyzing a root cause of a fault, comprising:

s1, sorting according to the time attribute of each data in the data set, and segmenting the data set according to a preset time window to obtain a plurality of groups of sub data sets;

s2, acquiring a frequent item set and an association rule in a data set according to an Apriori algorithm, wherein the frequent item set comprises a certain amount of data with strong association;

s3, sorting according to the time attribute of the data in the frequent item set, sequentially matching the data in the front sorting with the accompanying alarm reason data prestored in the alarm reason database, if the matching is successful, removing the data, continuously matching the next item, and finally taking the data which is unsuccessfully matched and is sorted in the front as the root reason of the data with the last time sequence in the frequent item set;

the data set comprises alarm data of each domain in the IT system in a preset time range, error data in a log and abnormal performance data in the performance data set.

Preferably, the step S1 is preceded by:

acquiring alarm data and performance data of each domain in the IT system within the preset time range through an APM probe, and acquiring error data of log data in the IT system through a log acquisition party;

screening abnormal performance data in the performance data by adopting a mean value and a multiple variance;

and forming the alarm data of each domain in the IT system, the error data in the log and the abnormal performance data in the preset time range into the data set.

Preferably, the step of screening abnormal performance data in the performance data by using the mean and the multiple variance specifically includes:

sorting the performance data in the performance data set according to a preset rule, and taking the performance data of the median as a mean value;

and calculating the variance of the performance data, filtering out the performance data with the numerical value in the range from the mean value to 3 times of the variance, and taking the residual performance data as the abnormal performance data.

Preferably, the step S2 specifically includes:

counting the support degrees of all data in the data set, then sorting the data from high to low to obtain a candidate 1-item set, removing the data which is less than the minimum support degree in the candidate 1-item set, and obtaining a frequent 1-item set;

using a layer-by-layer search technique according to an Apriori algorithm until a frequent m-term set is obtained, satisfying the condition: frequently the m-item set is not empty and the (m-1) -subset is frequent, m is no greater than the number of data in the subdata set with the most data, and the (m +1) -item set is empty;

all terms of the frequent m-term set are listed, and association rules are generated according to Apriori algorithm.

Preferably, the step S3 is followed by: and displaying the association rule and the root reason.

Preferably, the IT system comprises one or more of the following domains: services, networks, applications, databases, external interfaces, containers, virtual machines, and physical storage.

According to another aspect of the present invention, there is also provided a system for analyzing a root cause of a fault, including:

the segmentation module is used for sequencing according to the time attribute of each data in the data set and segmenting the data set according to a preset time window to obtain a plurality of groups of sub data sets;

the association module is used for acquiring a frequent item set and an association rule in a data set according to an Apriori algorithm, wherein the frequent item set comprises a certain amount of data with strong association;

the root cause searching module is used for sequencing according to the time attribute of the data in the frequent item set, sequentially matching the data in the front of the sequence with the accompanying alarm cause data prestored in the alarm cause database, removing the data from the data set if the matching is successful, continuing to match the next item, and finally taking the data which is unsuccessfully matched and is sequenced in the front as the root cause of the data at the end of the time sequence in the frequent item set;

Preferably, the system further includes a data set obtaining module, where the data set obtaining module specifically includes:

the collection unit is used for acquiring alarm data and performance data of each domain in the IT system within the preset time range through the APM probe and acquiring error data of log data in the IT system through a log acquisition party;

the screening unit is used for screening abnormal performance data in the performance data by adopting a mean value and a multiple variance;

and the aggregation unit is used for forming the alarm data of each domain in the IT system, the error data in the log and the abnormal performance data in the performance data into the data set within the preset time range.

Preferably, the screening unit is specifically configured to:

sorting the performance data in the performance data set according to a preset rule, and taking a median as a mean value;

Preferably, the association module is specifically configured to:

Drawings

FIG. 1 is a flow chart of a method for analyzing a root cause of a fault according to an embodiment of the present invention;

FIG. 2 is a functional block diagram of a system for analyzing root causes of faults according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

In order to overcome the above problems in the prior art, an embodiment of the present invention provides a method for analyzing a root cause of a fault, where the design concept of the method is as follows: the alarm occurrence is caused by a certain root fault, the root fault can cause other alarms to be generated together, which is called accompanying alarm and causes alarm storm, so that the earliest fault data needs to be found from the time sequence, according to the association rule mining method, a plurality of fault data which meet the minimum support degree and have association relation are obtained as frequent items, the fault data which are ranked in the front of the frequent items are compared with a preset alarm reason database, if the comparison is successful, the fault data are the root reason data, and if the comparison is unsuccessful, the fault data which are ranked in the front are continuously compared with the alarm reason database until the matching is successful. Through practical tests, the method provided by the embodiment of the invention can accurately and quickly find out valuable alarm association rules and root causes, and provides decision support for system maintenance personnel.

Specifically, fig. 1 shows a flowchart of a method for analyzing a root cause of a fault according to an embodiment of the present invention, where as shown in the figure, the method includes:

101. sorting according to the time attribute of each data in the data set, and segmenting the data set according to a preset time window to obtain a plurality of groups of sub data sets; the data set comprises alarm data of each domain in the IT system in a preset time range, error data in a log and abnormal performance data in the performance data set.

IT should be noted that, the method firstly collects fault data within a certain time range, including alarm data of each domain (in the IT domain, an error is error, for example, a network is disconnected, and a network error comes out; an alarm is an unprocessed error, and an alarm is determined to mean that an error occurs but the erroneous IT system is unprocessed), error data in logs and abnormal performance data in a performance data set, and the fault data are stored in a database through data cleaning to form a data set for subsequent problem tracing and root cause positioning.

As is well known to those skilled in the art, the log data has levels, such as an info level, a debug level, an error level, and the like, and the log category is determined by the level according to the embodiment of the present invention. The Debug level data is log data with the lowest level, and generally, is not output in the actual operation process of the system. The Info-level log data is used to feed back the current state of the system to the end-user, so the information output here should have a practical meaning to the end-user, i.e. the end-user should be able to see what it means. The information output by the Info can be viewed as part of the software product (as is the text on those interactive interfaces) in some sense. Error level data, i.e. Error data, can be used for some repairable work, but it cannot be determined that the system will work normally, and at a later stage, the system may cause an unrepairable Error (e.g. a downtime) due to the current problem, but may also work until the system is stopped without serious problems.

In a computer system, each data has a time attribute, which indicates the start time, end time, etc. of the data. The data sets are sorted according to the starting time of each data in the data sets, so that the generation sequence of each data (namely, the fault) is obtained, the data sets are further segmented according to the time windows, and the data in the data sets can be classified into the sub-data sets corresponding to different time windows. For example, analyzing data from 10 to 12 points on a certain day for 120 minutes, dividing the data into 12 groups of data by window granularity of 10 minutes, wherein each group of data comprises a plurality of abnormal performance data and alarm data/error data, and the conception of the embodiment of the invention is as follows: if performance problems occur, many problems can be simultaneously developed at some time, but most of the problems are concentrated on several core problems, and the core problems are found out by performing probability analysis on the decomposed set. The data mining method provided by the embodiment of the invention is beneficial to finding out data with higher occurrence probability from a large amount of data.

102. Acquiring a frequent item set and an association rule in a data set according to an Apriori algorithm, wherein the frequent item set comprises a certain amount of data with strong association;

it should be noted that the Apriori algorithm is a representative algorithm for Association rule mining (Association rule mining), and is used for mining a boolean Association rule frequent item set, so-called frequent item set, which is a data set frequently appearing in a data set as the name suggests. The design concept of the embodiment of the invention is that the data with strong association relationship has higher probability of belonging to the relationship between the root alarm and the related accompanying alarm.

103. And sequencing according to the time attribute of the data in the frequent item set, sequentially matching the data in the front sequence with the accompanying alarm reason data prestored in the alarm reason database, if the matching is successful, removing the data, continuing to match the next item, and finally taking the data which is unsuccessfully matched and is in the front sequence as the root reason of the data with the last time sequence in the frequent item set.

It should be noted that the frequent item set is a set of several data with strong association, for example, the frequent item set (high memory usage, high CPU usage) is sorted according to the time attribute of the data in the frequent item set, and it can be known that the high memory usage occurs before the high CPU usage, and the high memory usage is matched with the possible root cause and the accompanying alarm cause data pre-stored in the alarm cause database. The alarm reason database is an operation maintenance management database created by the embodiment of the invention, and the database stores alarm data and relevant accompanying alarms. The alarm reason database in the embodiment of the invention can be updated by operation and maintenance personnel according to actual requirements. In the embodiment of the invention, the data is sorted according to the sequence of the time attributes from front to back, the data with the highest sorting is the data with the earliest time, if the data is successfully matched with the burst alarm reason data, the data is removed from the frequent item set, and then the data with the highest sorting is matched, so that the data which cannot be matched is sorted until the data which cannot be matched is found out.

On the basis of the above embodiment, step 101 further includes a process of acquiring a data set, specifically, the process includes:

001. alarm data and performance data of each domain in the IT system in a preset time range are obtained through the APM probe, and error data of log data in the IT system are obtained through a log collection method.

Table 1 shows the various domains in the IT system and the corresponding performance data that needs to be collected according to an embodiment of the present invention.

TABLE 1 Domain and Performance data sheet for IT systems

002. And acquiring a performance data set in the IT system within a preset time range, and screening abnormal performance data in the performance data by adopting the mean value and multiple variance.

It should be noted that, because the performance data is linear, under normal conditions, the performance data steadily tends to a straight line, and when an abnormality occurs, the performance data may have an uneven curve, so the embodiment of the present invention screens the abnormal performance data by using the median and the multiple variance.

003. And forming a data set by alarm data of each domain in the IT system, error data in the log and abnormal performance data in the performance data within a preset time range.

On the basis of the above embodiment, the step of screening abnormal performance data in the performance data by using the mean and the multiple variance specifically includes:

sorting the performance data in the performance data set according to a preset rule (for example, the sequence from big to small), and taking a median as a mean value;

it should be noted that the median is used as the mean value in the embodiments of the present invention, because the median is more sensitive to the abnormal value, for example, 1, 2, 3, 4, 100, the mean value is 22, but the median is 3 (the middle value), and it is obviously reasonable to find the abnormal value by using the median.

And calculating the variance of the performance data, filtering out the performance data with the numerical value in the range from the mean value to 3 times of the variance, and taking the residual performance data as abnormal performance data.

On the basis of the foregoing embodiment, step 102 specifically includes:

all items of the m-frequent item set are listed, and association rules are generated according to Apriori algorithm.

It should be noted that, since the abnormal performance data itself is a discrete, non-linear data point, in general, the data in most applications is generated by one or more programs that reflect the system's function. When the underlying application runs in an abnormal manner, abnormal performance data is generated, and the abnormal performance data can be quickly and efficiently found to be very valuable. In IT systems, since the problem is cascading: the root cause is that the wind is generated firstly and can not be self-healed, so that other problems occur together to form an alarm storm. Therefore, when performing association analysis on abnormal performance data, the embodiment of the present invention performs the following process of association rule:

finding out all frequent item sets, namely the frequency spectrum item set, wherein the occurrence frequency of the set is not less than the minimum support degree; strong association rules are generated from the frequent set of items, which must satisfy a minimum support and a minimum confidence.

Specifically, the probability that both the memory usage and the CPU are higher than the preset threshold is calculated, that is, the number of times the above-mentioned problems occur simultaneously in one data set is divided by the total number of abnormal performance data in the data set. For example: support ({ memory high } - > { CPU occupied high }) -the number of times memory high and CPU occupied high co-occur/data record number 3/5-60%.

Finding strong association rules

It should be noted that, in the previous step, data with higher probability has been separated from massive performance data and abnormal performance data, and then the association rule is strengthened through probability analysis. In the embodiment of the present invention, a conditional probability analysis is adopted, for example, the probability that the CPU occupies a high level when the memory is high is calculated, whereas the memory occupies a low level and the CPU occupies a low level. For example: the number of times of simultaneous occurrence of a memory high and a CPU occupancy high/the number of times of occurrence of a memory high 3/3 is 100%; the number of times of simultaneous occurrence of memory high and CPU occupancy high/the number of times of occurrence of CPU occupancy high 3/4 is 75%.

To better understand the Apriori algorithm employed by embodiments of the present invention, the basic concept of the Apriori algorithm is first explained:

1. item set and K-item set

Let I ═ I1, I2, I3 … … id } be the set of all items (i.e. data) in the dataset, and T ═ T1, T2, T3 … tN } be the set of all transactions (i.e. time windows), each transaction ti containing a set of items that is a subset of I. In association analysis, a set containing 0 or more items is called an item set. If a set of items contains K items, it is referred to as a K-item set. An empty set refers to a set of items that do not contain any items. For example, { CPU occupancy is high, response time duration is high, memory usage is high } is a 3-entry set in one example of the invention. Table 2 shows a data set table of the embodiment of the present invention, where TID1 represents a subset corresponding to the first time window, and as can be seen from table 2, TID1 contains two sets of entries: CPU high and corresponding duration high.

TABLE 2 data set Table

2. Count of support counts

An important property of an item set is its support count, i.e., the number of transactions that contain a particular item set, mathematically, the support count σ (X) for item set X can be expressed as:

where the symbol | represents the number of elements in the set. In the embodiment described in table 2, the support count for the set of entries { latency is high, memory usage is high, response duration is high } is 2, since only 3 and 4 transactions contain these 3 entries simultaneously.

3. Association rules

An association rule is an implication expression shaped as X → Y, where X and Y are disjoint sets of terms, i.e.

The strength of an association rule may be measured in terms of its support (support) and confidence (confidence). The support determination rules may be used for how often a given data set occurs, while the confidence determines how often Y occurs in transactions containing X.

The two measures, support(s) and confidence (c), are formally defined as follows:

s(X→Y)＝σ(X∪Y)/N

c(X→Y)＝σ(X∪Y)/σ(X)

where σ (X U.Y) is the support count of (X U.Y), N is the total number of transactions, and σ (X) is the support count of X.

Example

In the embodiment described in table 2, consider the rule { response time high, memory usage high } → { latency high }. Since the support count for the set of entries { response time long, memory usage high, latency high } is 2, and the total number of transactions is 5, the support for the rule is 2/5 ═ 0.4.

The confidence of the rule is a quotient of the support counts of the item set { response time length is high, memory usage is high, latency is high } and the support techniques of the item set { response time length is high, memory usage is high }, and since there are 3 transactions that contain both response time length is high and memory usage is high, the confidence of the rule is 2/3 ═ 0.67.

Association rule discovery

Given a set of transactions T, the association rule discovery refers to finding all rules with a support degree greater than or equal to minsup (minimum support degree) and a confidence degree greater than or equal to minconf (minimum confidence degree), where minsup and minconf are corresponding support degree and confidence degree thresholds.

The mining of association rules is a two-step process:

(1) frequent item set generation: the goal is to find all sets of items (at least as many as the predefined minimum support count) that meet the minimum support threshold, which are called frequent sets of items.

(2) And (3) generating a rule: the goal is to extract all high confidence rules, called strong rules, from the set of frequent items found in the previous step. (minimum support and minimum confidence must be met)

The essence of the Apriori algorithm uses the candidate set to find a frequent item set. The Apriori algorithm is an algorithm for mining a frequent item set of boolean association rules, which has the most influence. The name of the algorithm is based on the fact that: the algorithm uses a priori knowledge of the nature of the frequent itemset, as we will see. Apriori uses an iterative approach called layer-by-layer search, where a set of k-terms is used to explore a set of (k +1) -terms. First, a set of frequent 1-item sets is found. This set is denoted L1. L1 is used to find the set of frequent 2-item sets, L2, and L2 is used to find L3, and so on until no frequent k-item sets can be found. One database scan is required to find each Lk.

Apriori properties: all non-empty subsets of the frequent item set must also be frequent. Apriori properties are based on the following observations: by definition, if the set of items I does not meet the minimum support threshold s, then I is not frequent, i.e., p (I) < s. If item A is added to I, the resulting set of items (i.e., I @ A) is unlikely to occur more frequently than I. Thus, also itou a is not frequent, i.e. P (itou a) < s. This property belongs to a special classification, called inverse monotonic, meaning that if a set fails the test, all its supersets also fail the same test. It is called inverse monotonic because the property is monotonic in the sense that it does not pass the test.

For the Apriori algorithm, if a set is a frequent item set, then all of its subsets are frequent item sets. Examples are: assuming that a set { memory high, CPU high } is a frequent item set, i.e. the number of times of simultaneous occurrence of memory high and CPU high in a record is greater than or equal to the minimum support min _ support, its subset { memory high }, and { CPU high } must be greater than or equal to min _ support, i.e. its subsets are frequent item sets. If a collection is not a frequent item set, then all of its supersets are not frequent item sets. Examples are: assuming that the set { memory high } is not a frequent item set, i.e., the number of occurrences of memory high is less than min _ support, then the number of occurrences of any superset thereof, e.g., { memory high, CPU occupied high } is necessarily less than min _ support, and thus its superset is necessarily neither a frequent item set.

The key to the Apriori algorithm is how to find Lk with Lk-1, which consists of the following two-step process:

a connecting step: to find Lk, a set of candidate k-term sets is generated by concatenating Lk-1 with itself. The set of candidates is denoted Ck. Let l1 and l2 be the set of items in Lk-1. The notation li [ j ] denotes the jth item of li (e.g., l1[ k-2] denotes the 3 rd last item of l 1). For convenience, it is assumed that the terms in the transaction or set of terms are ordered in lexicographic order. Performing a connection Lk-1; wherein the elements of Lk-1 are connectable if their first (k-2) entries are the same; that is, the elements l1 and l2 of Lk-1 are connectable if (l1[1] ═ l2[1]) Λ (l1[2] ═ l2[2]) Λ … Λ (l1[ k-2] ═ l2[ k-2]) Λ (l1[ k-1] < l2[ k-1 ]). The condition (l1[ k-1] < l2[ k-1]) is simply to ensure that no duplication occurs. The resulting set of terms resulting from the linkage of l1 and l2 is l1[1] l1[2] … l1[ k-1] l2[ k-1 ].

Pruning: ck is the superset of Lk; that is, its membership may or may not be frequent, but all of the frequent k-term sets are contained in Ck. The database is scanned and the count of each candidate in Ck is determined, thereby determining Lk (i.e., by definition, all candidates whose count value is not less than the minimum support count are frequent and thus belong to Lk). However, Ck can be large, and thus the amount of computation involved is large. For compressing Ck, Apriori properties can be used in the following way: any infrequent (k-1) -item set is not a subset of the likely frequent k-item set. Thus, if the (k-1) -subset of a candidate set of k-items is not in Lk-1, the candidate is also unlikely to be frequent and thus can be deleted by Ck. This subset testing can be done quickly using a hash tree of all the frequent item sets.

Generating association rules from a frequent set of items

Once the frequent set of terms is found by the transactions in database D, it is straightforward to generate strong association rules from them (strong association rules satisfy minimum support and minimum confidence). For confidence, the following equation can be used, where the conditional probability is expressed in terms of item set support counts. consistency (a → B) ═ P (a ═ B) ═ support (a ═ B)/support (a), where support (a ═ B) is the support count of (a £ B), and support (a) is the support count of a. From this equation, the association rule may be generated as follows:

f1, for each frequent item set l, all non-empty subsets of l are generated.

f2, for each non-empty subset s of l, if support (l)/support(s) ≧ min _ conf, the rule is output

Where min _ conf is the minimum confidence threshold. Since the rules are generated from a frequent set of items, each rule automatically satisfies a minimum support. The frequent item sets, along with their support, are pre-stored in the hash table so that they can be accessed quickly.

The Apriori algorithm is described below with an example in which a dataset has 9 time windows, i.e., 9 sub-datasets, | D | ═ 9. The sub data set T1 comprises data I1, I2 and I5; the sub data set T2 contains data I2 and I4; the sub data set T3 contains data I2 and I3; the sub data set T4 contains data I1, I2 and I4; the sub data set T5 contains data I1 and I3; the sub data set T6 contains data I2 and I3; the sub data set T7 contains data I1 and I3; the sub data set T8 comprises data I1, I2, I3 and I5; the sub data set T9 contains data I1, I2 and I3.

One), mining frequent item sets

1. On the first iteration of the algorithm, each term is a member of the set of candidate 1-terms C1, the algorithm simply scans all transactions, counting the number of occurrences of each term.

2. Assume that the minimum transaction support count is 2 (i.e., minsup-2/9-22%). A set of frequent 1-item sets L1 may be determined. It consists of a candidate 1-item set with minimal support.

3. To find the set of frequent 2-item sets, L2, the algorithm uses L1 xL 1 to produce the set of candidate 2-item sets, C2.

4. The transaction in D is scanned and a support count for each candidate item in C2 is calculated.

5. A set of frequent 2-item sets L2 is determined, which consists of the candidate 2-item sets in C2 with the least support.

6. The generation of the candidate set of 3-items C3 is detailed in the figure. First, let C3 be L2L 2 { { I1, I2, I3}, { I1, I2, I5}, { I1, I3, I5}, { I2, I3, I4}, { I2, I3, I5}, { I2, I4, I5} }. According to Apriori properties, all subsets of the frequent item set must be frequent, and we can determine that the last 4 candidates are unlikely to be frequent. Therefore, we have deleted them from C3, so that it is not necessary to count them later when scan D determines L3. Note that the Apriori algorithm uses a layer-by-layer search technique, and given a set of k-terms, we need only check whether their (k-1) -subset is frequent.

[ L2L 2 ligation Process to generate C3 ]

1. Connecting: c { { I, I }, { I, I } } { { I, I }, { I }, and { I } }

2. Pruning using Apriori properties: all subsets of the frequent item set must be frequent.

The 2-item subset of f { I1, I2, I3} is { I1, I2}, { I1, I3} and { I2, I3 }. All 2-item subsets of { I1, I2, I3} are elements of L2. Thus, { I1, I2, I3} is retained in C3.

The 2-item subset of f { I1, I2, I5} is { I1, I2}, { I1, I5} and { I2, I5 }. All 2-item subsets of { I1, I2, I5} are elements of L2. Thus, { I1, I2, I5} is retained in C3.

The 2-item subset of f { I1, I3, I5} is { I1, I3}, { I1, I5} and { I3, I5 }. { I3, I5} are not elements of L2 and are therefore infrequent. Thus, { I1, I3, I5} is deleted from C3.

The 2-item subset of f { I2, I3, I4} is { I2, I3}, { I2, I4} and { I3, I4 }. { I3, I4} are not elements of L2 and are therefore infrequent. Thus, { I2, I3, I4} is deleted from C3.

The 2-item subset of f { I2, I3, I5} is { I2, I3}, { I2, I5} and { I3, I5 }. { I3, I5} are not elements of L2 and are therefore infrequent. Thus, { I2, I3, I5} is deleted from C3.

The 2-item subset of f { I2, I4, I5} is { I2, I4}, { I2, I5} and { I4, I5 }. { I4, I5} are not elements of L2 and are therefore infrequent. Thus, { I2, I3, I5} is deleted from C3.

3. After pruning, C3 { { I1, I2, I3}, { I1, I2, I5}

7. The transaction in D is scanned to determine L3, which consists of the set of candidate 3-items in C3 with the least support.

8. The algorithm used L3 xl 3 to generate the set of candidate 4-term sets C4. Although the concatenation yields the result { { I1, I2, I3, I5} }, this set of items is pruned because its subset { I1, I3, I5} is infrequent. In this way it is possible to obtain,

the algorithm terminates and finds all the frequent sets of terms.

On the basis of the above embodiments, the IT system comprises one or more of the following domains: services, networks, applications, databases, external interfaces, containers, virtual machines, and physical storage.

On the basis of the above embodiment, step 103 further includes: and displaying the association rule and the root reason. It should be noted that, by displaying the association rule and the root reason, the operation and maintenance personnel can provide decision support conveniently.

Fig. 2 shows a functional block diagram of a system for analyzing a root cause of a fault according to an embodiment of the present invention, and as shown in the figure, the method includes:

the segmentation module 201 is configured to sort the data sets according to the time attributes of the data in the data sets, and segment the data sets according to a preset time window to obtain a plurality of groups of sub data sets; the data set comprises alarm data of each domain in the IT system in a preset time range, error data in a log and abnormal performance data in the performance data set.

IT should be noted that the segmentation module of the system first collects fault data within a certain time range, including alarm data of each domain (in the IT domain, an error is error, for example, a network is disconnected, and a network error comes out; an alarm is an unprocessed error, and an alarm is positive and means that an error occurs, but the erroneous IT system is unprocessed), error data in a log, and abnormal performance data in a performance data set, and these fault data are stored in a database through data cleaning to form a data set for subsequent problem tracing and root cause positioning.

In a computer system, each data has a time attribute, which indicates the start time, end time, etc. of the data. The data sets are sorted according to the starting time of each data in the data sets, so that the generation sequence of each data (namely, the fault) is obtained, the data sets are further segmented according to the time windows, and the data in the data sets can be classified into the sub-data sets corresponding to different time windows. For example, analyzing data from 10 to 12 points on a certain day for 120 minutes, dividing the data into 12 groups of data by window granularity of 10 minutes, wherein each group of data comprises a plurality of abnormal performance data and alarm data/error data, and the conception of the embodiment of the invention is as follows: if performance problems occur, many problems can be simultaneously developed at some time, but most of the problems are concentrated on several core problems, and the core problems are found out by performing probability analysis on the decomposed set. The data mining system provided by the embodiment of the invention is beneficial to finding out data with higher occurrence probability from a large amount of data.

The association module 202 is configured to obtain a frequent item set and an association rule in a data set according to an Apriori algorithm, where the frequent item set includes a certain amount of data with a strong association relationship.

And the root cause searching module 203 is configured to sort according to the time attribute of the data in the frequent item set, sequentially match the data in the top order with the accompanying alarm cause data pre-stored in the alarm cause database, remove the data if the matching is successful, continue to match the next item, and finally use the data which is unsuccessfully matched and is in the top order as the root cause of the data with the last time sequence in the frequent item set.

On the basis of the above embodiment, the system of the embodiment of the present invention further includes a data set acquisition module, where the data set acquisition module specifically includes:

On the basis of the above embodiments, the screening unit is specifically configured to:

On the basis of the foregoing embodiments, the association module is specifically configured to:

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of analyzing a root cause of a fault, comprising:

2. The method of claim 1, wherein the step S1 is preceded by:

3. The method of claim 2, wherein the step of screening the performance data for abnormal performance data using the mean and the multiple variance comprises:

4. The method according to claim 1, wherein the step S2 specifically includes:

using a layer-by-layer search technique according to Apriori's algorithm until a frequent m-term set is obtained, which satisfies the condition: frequently the m-item set is not empty and the (m-1) -subset is frequent, m is no greater than the number of data in the subdata set with the most data, and the (m +1) -item set is empty;

5. The method of claim 1, wherein the step S3 is further followed by: and displaying the association rule and the root reason.

6. The method of claim 1, wherein the IT system comprises one or more of the following domains: services, networks, applications, databases, external interfaces, containers, virtual machines, and physical storage.

7. A system for analyzing a root cause of a fault, comprising:

the root cause searching module is used for sequencing according to the time attribute of the data in the frequent item set, sequentially matching the data which is sequenced at the front with the accompanying alarm cause data which is prestored in the alarm cause database, removing the data if the matching is successful, continuously matching the next data, and finally taking the data which is unsuccessfully matched and is sequenced at the front as the root cause of the data with the last time sequence in the frequent item set;

8. The system of claim 7, further comprising a dataset acquisition module, the dataset acquisition module specifically comprising:

9. The system of claim 8, wherein the screening unit is specifically configured to:

10. The system of claim 7, wherein the association module is specifically configured to: