GB2465860A

GB2465860A - A directed graph behaviour model for monitoring a computer system in which each node of the graph represents an event generated by an application

Info

Publication number: GB2465860A
Application number: GB0913712A
Authority: GB
Inventors: Hany Mohamed Hassan; Ahmed Khairy Farahat; Sherif Mohamed El-Rafei
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2008-12-04
Filing date: 2009-08-06
Publication date: 2010-06-09
Anticipated expiration: 2029-08-06
Also published as: GB0913712D0; GB2465860B

Abstract

A set of at least one event generated by the application is received and stored in a database. The set of events corresponds to a path in the directed graph and is associated with an attribute, e.g. a process name, an instance ID or customer type. The frequency of occurrence of such sets in the database is determined and a weight, being a function of the determined frequency, is assigned to the corresponding path 1140 — 1152 in the graph. The probability of occurrence of a node in the path is determined as a function of the weight. A computer system, comprising an application, may be monitored according to the behaviour model; a message generated by the application corresponding to an event in the graph and the likelihood of occurrence of the message being determined and compared with a predefined threshold. The state of the computer system is determined as a function of the comparison: anomalous or not.

Description

METHOD AND SYSTEM FOR DETECTING AND PREDICTING ANOMALOUS

SITUATIONS IN A COMPUTER SYSTEM

Field of the Invention

The present invention relates to a method and system for monitoring components in an IT system, and more particularly for detecting IT systems which exhibit an anomalous behaviour, as early as possible.

Background of the Invention

Today IT systems consist of a huge number of components that interact with each other to carry out the desired business. These components may be applications, services or devices.

During system operation, one or more components may exhibit a behaviour that deviates from what is expected. This behaviour is an anomalous behaviour that needs to be detected and handled. The anomalous behaviour is also called each situation or an exception. This process of detecting and handling anomalous behaviour is called anomalous behaviour management, situation management or exception management.

Traditional systems use two approaches to detect and handle any deviation from the expected behaviour. In a first approach, users manually employs sophisticated analyses techniques to detect significant situations, investigate their root causes and then take the appropriate corrective action. The main problem with this approach is that situations are detected after their occurrence, and not while the business is performing. A second approach for situation management depends on domain experts to define criteria to the detection of the anomalous behaviour. These quatemary are usually encoded in terms of condition action rules which are used by the monitoring system to automatically detect and handle significant situations. The main problem with this approach is that it assumes an a priori knowledge of the anomalous behaviour is and therefore does not detect hidden potentially more critical situations. For example US patent application 2005/0267765A1, filed by Jun-Jang Jeng et al, "apparatus and method for policy driven business process exception handling" provides an exception management framework in the business process domain. The framework allows users to define exception policies in a declarative manner. US patent 6604093, filed by Opher Etzion et al, "situation awareness system", provides a method for situation management that allows users to define complex events using events composition operators. The main problem with this approach is that it only covers obvious situations, and not hidden, potentially more critical situations. US patent application 2003/0149604A1, filed by Fabio Casati, "exception analysis, prediction and prevention method and system", proposes a method that data mining techniques to generate justification rules that identifies normal from exceptional process instances. But this method depends on labelled process instances to train the classifier, therefore it can only detect previously known exceptions. Moreover the classification rules do not encode the dynamic behaviour of the process instances, thus not detecting process instances which exhibit exceptional sequence of events.

Summary of the Invention

According to a first aspect of the present invention, there is provided a method for generating a behaviour model, said behaviour model comprising a directed graph, each of the nodes of said directed graph representing an event generated by an application, said method comprising the steps of: -receiving a set of at least one event generated by said application, said set corresponding to a path in said directed graph and being associated with an attribute; -storing said set associated with said attribute in a database; -determining the frequency of occurrence in said database of said set associated with said attribute; -assigning a weight to said corresponding path in said directed graph, said weight being a function of the frequency so determined; and -determining a probability of occurrence of a node in said path as a function of said weight.

An advantage of this aspect is that the obtained behaviour model represents accurately the likelihood of occurrence of a particular set of events, including those which don't occur often.

This model can be further updated while being used taking into account the latest received events.

In a first development of the first aspect, the method further comprises the steps of: -identifying a first group of at least one said set of events in said database associated with said attribute, the total probability of occurrence of said first group being greater than a first predefmed value; -identifying a second group of at least one said set of events in said database associated with said attribute, the total probability of occurrence of said second group being smaller than a second predefined value; and -determining the difference between the probability of occurrence of the least likely set of events of the first group and the probability of occurrence of the most likely set of events of the second group.

An advantage is that the two groups defined by this method represent the normal and the anomalous events or situations in a distinct manner. This representation depends on the actual behaviour of the component involved and thus is tailored to any situation. It is much more accurate than predefined thresholds arbitrary set for any situation.

In a second aspect of the present invention, there is provided a method for monitoring a computer system, said computer system comprising an application, said application being configured to generate a message, said message being compliant with a predefined message structure, said method comprising the steps of: -receiving a message generated by said application; -identifying a behaviour model, wherein one event in the directed graph of said behaviour model corresponds to said received message generated by said application; -determining a likelihood of occurrence of said message, as a function of said behaviour model so identified; and -deciding whether said computer system is in an anomalous state, as a function of a comparison between said likelihood of occurrence and a predefined threshold.

An advantage of this second aspect is that problems in the behaviour of IT system can be detected early. Furthermore the method leverages lessons learned from past behaviours which have evolved in problematic situations.

In a first development of the second aspect, the predefined message structure is the common base event schema.

An advantage is that it could leverages a well-known and open standard for defining log events.

In a second development of the second aspect, said computer system comprises a second application, configured to generate a second message, compliant with said message structure, and comprising the further steps of: -receiving a set of at least a first and a second events generated respectively by said application and said second application, said set corresponding to a path in said directed graph and being associated with an attribute.

An advantage is that, in modem complex computer systems, situations can arise not only within one computer system but also in how different components of an IT system cooperate.

The ability to monitor behaviour comprising events generated by different components is then critical to detect and predict these situations correctly.

In a third development of the second aspect, the predefined threshold is determined as a function of the difference of occurrence of the two groups determined during the leaming phase.

An advantage is that this takes into account the specifics of the behaviour of the IT system being monitored.

In a fourth development of the second aspect, the method for learning a behaviour model is further provided.

An advantage is that both methods can cooperate to provide the best results.

According to a third aspect of the present invention, there is provided an apparatus comprising means adapted for carrying out each step of the method according to the first aspect of the invention.

An advantage is that this apparatus can be obtained very easily, thus making the method easy to execute.

According to a fourth aspect of the present invention, there is provided a computer program comprising instructions for carrying out the steps of the method according to a first aspect of the invention when said computer program is executed on a computer.

An advantage is that the invention can easily be reproduced and run on different computer systems.

According to a fifth aspect of the present invention, there is provided a computer readable medium having encoded thereon a computer program according to the third aspect of the invention.

An advantage is that this medium can be used to easily install the method on various apparatus.

Further advantages of the present invention will become clear to the skilled person upon examination of the drawings and detailed description. It is intended that any additional advantages be incorporating therein.

Brief Description of the Drawings

Embodiments of the present invention will now be described by way of example with reference to the accompanying drawings in which like references denote similar elements, and in which: Figure 1 a system in which the present invention can be implemented.

Figure 2 shows a high level process with the main steps in an implementation of the invention; Figure 3 shows a high level process of the learning phase in an implementation of the invention; Figure 4 shows a hierarchical clustering of units of work; Figure 5 shows the search space for frequent patterns in an implementation of the invention; Figure 6 shows a sample Weighted Finite State Transducer (WFST) that represents the behaviour of a unit of work; Figure 7 shows an overview of the data model in an implementation of the present invention; Figure 8 shows a method for leaning a WFST associated with a cluster, in an implementation of the present invention; Figure 9 shows a finite state transducer representing all the legal sequences of events in a behaviour template, in an implementation of the present invention; Figure 10 shows a finite state transducer representing the behaviour of a unit of work, in an implementation of the present invention; Figure 11 a shows the finite state transducer representing all the legal sequences of events, each legal sequence having an arbitrary small weight, in an implementation of the present invention; Figure 1 lb shows the union of all the transducers representing a possible sequence of events in the behaviour template, in an implementation of the present invention; Figure 11 c shows the resulting transducer with the probability of occurrence of each path, in an implementation of the present invention; Figure lid shows the WEST with the weights pushed toward the start, in an implementation of the present invention; Figure 12 shows a method for monitoring executed units of work using WFST and measuring their normality, in an implementation of the present invention; Figure 13a and 13b illustrate the computation of the measure of normality of a sequence of events, in an implementation of the present invention; Figure 14 shows a method for monitoring running units of work using WFST and measuring their normality, in an implementation of the present invention; and Figure 15 illustrates the computation of the threshold for the normality measure, in an implementation of the present invention.

Detailed Description of the Preferred Embodiment

Figure 1 a system in which the present invention can be implemented, comprising: -a set of IT components (101, 105, 107) configured to generate common events (102, 106, 108); -a common event infrastructure (110); -a database of units of work (115); -a database of behaviour templates (127); -a clustering engine (120); -a database of units of work clusters (125); -a behaviour models learning engine (130); -a database of learnt behaviour models (135); and -a monitoring engine (140).

The common events (102, 106, 108) generated by the set of IT components (101, 105, 107) are collected by the common event infrastructure (110). Based on the behaviour templates stored in the database of behaviour templates (127), the common event infrastructure (110) organises the collected events (102, 106, 108) into units of work. These units of work are then stored in the database of units of work (115). The clustering engine (120) organises the units of work (115) into units of work clusters (125). The behaviour models learning engine (130) generate from the database of behaviour templates (127) and the database of units of work clusters (125) the behaviour models and store them into the database of learnt behaviour models (135). The monitoring engine (140) collects sequences of events from the common event infrastructure (110) and analyze them against the database of learnt behaviour models (135) to decide whether the monitored IT components (101, 105 or 107) is in an anomalous state or not.

In an embodiment of the present invention, a method for automatically detecting and predicting anomalous behaviour in IT systems is proposed. The method first allows system users to define a set of behaviour templates for a specific set of system components which encode the contexts during which the behaviour of these components needs to be monitored.

These behaviour templates are then stored in the database of behaviour templates (127). The method then automatically learns the normal behaviour of the components from the events log stored in the database of units of work (115). The clustering engine (120) and the behaviour models learning engine (130) are involved during that step. The clustering engine (120) analyzes the units of work logged in the database of units of work (115) and classify them according to their characteristics. The behaviour models learning engine (130) then analyzes them to determine which behaviour appears frequently and which does not. Finally, the monitoring engine (140) classifies as anomalous any deviation in the behaviour of the running components (101, 105, 107) from the normal behaviour. The proposed method and system are based on the fact that "given a large log of the events emitted by the system components, the dominant behaviour in this log characterizes the normal behaviour of these components". The behaviour templates encode the contexts during which the behaviour of the system components needs to be monitored. Each context is a part of the components behaviour that represents a unit of work in the business domain. The unit of work is the span of system or business events in a certain a period of time that can be identified as a specific system or business task. Examples of units of work include business processes executed by a process execution server, Web services executed by a Web server, and payment transactions executed by a set of financial applications. The behaviour of a component or a set of components in a specific unit of work is defined to be the sequence of events (102, 106, 108) emitted by the associated components (101, 105, 107) within this unit of work. This behaviour depends on the attributes that describes the unit of work; e.g. the behaviour of an instance of a sales order process depends on the customer type, the product type, and the quantity of items. This means that, "units of work of the same type that have similar attributes have similar sequence of events" and "units of work of the same type that have attributes similar to the normal units of work but have different sequence of states/events are anomalous"; e.g., customers with credit lines usually get their sales orders accepted, and hence it is anomalous to see customers with credit lines who got their orders rejected.

Figure 2 shows a high level process with the main steps in an implementation of the invention. This process comprises three main phases: -a setup phase (203); -a learning phase (206); and -a run-time phase (209).

During the setup phase (203), a user defines (210) behaviour templates specifications (212) which are stored in the database of behaviour templates (127). The behaviour templates (215) for a specific set of system components encode the contexts during which these components are required to be monitored. The definition of a behaviour template involves the specification of a set of attributes in the common base events (102, 106, 108).

Each context represents a unit of work in the business domain. Examples of units of work include business processes executed by a process execution server, Web services executed by a Web server, and payment transactions executed by a set of financial applications. The behaviour template is specified for a specific set of system components by defining the following set of attributes in the Common Base Events (CBE): 1. Attributes that describe the type of the unit of work (U0W Type Attributes); 2. Attributes that identify the unit of work (UoW identification Attributes); 3. Attributes that describe the type of events associated with the unit of work (Event Type Attributes); 4. Attributes that describe the objects associated with the unit of work (UoW Data Attributes).

For example, in the business process monitoring domain, the component to be monitored can be a business process management system and the units of work to be monitored are the business process instances. The following table describes a specification of a behaviour template attributes in the business process domain.

behaviour template Description Example: the business process domain

specification Attribute name attribute values

attributes Unit of work type Describe the type of Process Name Sales order process attributes the unit of work Unit of work Identifij the unit of Process instance ID Sales order process identification work No 0001 attributes Event type attributes Describe the type of Activity name Check credit rating the events associated Unit of work data Describe the data Business data {customer type, attributes associated with the product type, product unit of work quantity) A second example in the web service domain: behaviour template Example: the web service domain

specification Attribute name attribute values

attributes Unit of work type Web Service name Place order attributes Unit of work Web service instance Place order No 0001 identification ID attributes Event type attributes Condition Object in stock can Unit of work data Business data {customer type, product type, product attributes quantity} In the setup phase, the method allows system users to define a set of behaviour templates for a specific set of system components which encode the contexts during which these components are required to be monitored. The definition of a behaviour template involves the specification of a set of attributes in the common base events (CBE).

During the learning phase (206), the activity learn behaviour models' (220) is executed. The clustering engine (120) uses hierarchical clustering techniques to organize the space of all units of work into a tree of clusters. Each level of the tree represents a degree of similarity.

Then the learning engine (130) receives events log (222) from the database of units of work clusters (125). For each cluster, a dynamic probabilistic model that represents the sequences of events for the units of work that belong to that cluster. The learning engine (130) relies on the behaviour templates (215) defined in the setup phase (203) to model the normal behaviour of a component or a set of components in a specific context, the method has to learn a behaviour model for each set of units of work of a specific type that have similar attributes.

The learning phase involves two steps: grouping of similar units of work into clusters, and running a behaviour model for each cluster.

After the learning phase (206), the monitoring engine (140) monitors the behaviour of running units of work by analyzing the events emitted by the running components (232) against the learnt behaviour models (225) determined during the learning phase (206) to detect and predict anomalous behaviour (230). Each unit of work belongs to one or more clusters. The method identifies these clusters, and continuously calculates, for each cluster, the probability that a unit of work with the associated attributes exhibits the observed behaviour. This probability represents a measure of normality for the associated unit of work. If this measure goes below a predefined threshold, the system will alert the user that the associated unit of work is anomalous and that it has detected an anomalous behaviour pattern (235). During the run-time phase, the method uses the attributes of running units of work to identify the clusters to which they belong, and uses the learnt models for these clusters to judge the execution of these units of work. The method classifies as anomalous any deviation from the learnt behaviour in any of these clusters. The used model allows the detection of anomalous patterns in executed units of work and the prediction of anomalous Spartans in the running units of Figure 3 shows a high level process of the learning phase in an implementation of the invention, comprising: -a pre-processing activity (300); -a clustering of units of work activity (310); -a learning the behaviour of clusters activity (320); -a storing in the repository of behaviour models activity (330).

The pre-processing activity (300) takes as inputs the events log (222) and the behaviour templates specifications (212). The clustering of units of work activity (310) takes as inputs the events of units of work (305) generated by the pre-processing activity (300). The learning the behaviour of clusters activity (320) takes as input the clusters of instances (315) and generates learnt behaviour models (225). These learnt behaviour models (225) are then stored in the repository of behaviour models (330).

during the learning phase the method the learnt a set of behaviour models for each behaviour template. As a pre-processing step (300), the method first extracts from the events log the common base events associated with the components associated with the behaviour template.

The method then uses the specification of the behaviour template to extract the units of work and the sequences of events associated with each unit of work (310). The learning process then proceeds by running the behaviour models for each type of unit of work (320). The learning process involves two steps: grouping the unit of work of a specific type into clusters based on the value of the data attributes (unit of work data attributes), and learning a common behaviour for each cluster. During the learning phase the method monitors the different TT components and captures the events generated by these components. The behaviour models can comprise events generated by different IT systems in one model. Events generated by a system can correspond to the detection by the system of an error, an update of a table in a database, the execution of a particular activity in a business process. Events can also correspond to a security event generated by a system. The generation of a model representing correctly a sequence of events is the responsibility of the user. These models are organised into units of work. These units of work are then classified into clusters in an embodiment of the present invention. The clustering of units of work is described into more details with respect to figure 4.

Figure 4 shows a hierarchical clustering of units of work based on their properties or attributes. The space of units of work can be organised in different overlapping clusters: -a cluster can represent the set of all processes (400); -it can be the set of units of work related to a particular customer (410; 430) or to a particular product (420); -a cluster can also represent the set of units of work related to particular customer and product (440), etc. The method divides the past units of work instances into overlapping clusters based on their properties/attributes. Each cluster represents instances with similar properties/attributes. The method employs a hierarchical clustering technique that organizes the units of work instances into a tree of clusters as shown in figure 4.

The method defines multiple degrees of similarity between units of work instance properties/attributes. Two instances are perfectly similar if they have the same values for all properties/attributes, and they are weakly similar if they have the same value for only one property/attribute. Each level of the tree represents a degree of similarity. The root node represents all the instances of the process. Each node on the ith level represents instances that have the same values for i attributes. If the instances have n attributes, there will be n similarity degrees and n levels of clusters.

Each cluster is assigned a weight that represents its support in the training instances. This weight affects the evaluation of executed and running instances that belong to that cluster.

Deviations in the behaviour of frequent clusters are more important than deviations in the behaviour of rare clusters.

The problem with the clustering process is the huge space need to be explored. This makes tree construction computationally expensive. The method reduces the clusters space as follows: a. The method groups the values of continuous attributes into ranges (e.g., quantity attribute values may be grouped into the following ranges 0-100, 100-10000, >10000). Each node represents a range of values instead of a single value. The ranges are extracted automatically from the histogram of all values seen in the past instances. It is also possible to group attributes with discrete values in the same cluster if they are not representative enough when taken in isolation.

b. The method prunes clusters with "low" support in the past units of work.

An implementation using a frequent pattern mining algorithm performing hierarchical clustering in an efficient way is described with respect to figure 5.

Figure 5 shows the search space for frequent patterns according to the main embodiment. The search space can be organized as a tree of patterns structures: -with a root node (500) representing all the patterns and all the past units of work; -with first level nodes (510, 520, 530) representing patterns and past units of work having one attribute or property at a certain value (for instance customer = Cl); -with second level nodes (540) representing patterns and past units of work instances having two attributes at certain values (for instance customer = Cl and product = p2); and so on for higher levels (550).

As the number of possible patterns is extremely large, it is desirable to optimize the search space to make the clustering process more efficient. To that end, a frequent pattern mining algorithm can be used to discover clusters whose support is greater than a predefined threshold in an efficient manner. An example of a frequent pattern mining algorithm is presented in the document by Haixun Wang, Chang-Shing Perng, Sheng Ma, Philip S. Yu, "Mining Associations by Pattern Structure in Large Relational Tables", Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM'02). Each cluster is represented by a pattern of instance properties/attributes (e.g., Customer = ci, Product = p2). The frequent pattern algorithm performs a level-wise search that organizes the search space as a tree of pattern structures. Each node (500, 510, 540, etc) in the search space maintains a list of the patterns it represents along with the support of each pattern in past units of work instances.

The algorithm reduces the search space based on an extended "downward closure property" between parent and child pattern structures in the search tree (e.g., if patterns of structure { Customer, Product) (540) are not frequent, patterns of structure {Customer, Product, Quantity} (550) are not frequent too). This improves the performance significantly especially in handling high dimensional properties/attributes.

The organization of the search space shown in figure 5 is similar to the tree structure used to represent the clusters as described with respect to figure 4. Each node in the search space is described by a subset of properties/attributes and represents all clusters described by the values of these properties/attributes (e.g., clusters described by Customer, Product attributes).

The problem of mining the search space is a frequent problem in data mining techniques, in particular in association rule learning. Other algorithms can be used to organize the search space, such as the K-optimal pattern discovery algorithm, sequence mining or itemset mining techniques, the a priori algorithm, or the FP-Growth algorithm.

Figure 6 shows a sample Weighted Finite State Transducer that represents the behaviour of a unit of work, comprising: -at least one starting node (600); -at least one ending node (620), which are represented in figure 6 with two concentric circles; -any number of intermediate nodes (605); -links (610) between the nodes; -activities (635, 630) and probabilities (637, 633) associated with the links.

Each intermediate node (605) has at least one incoming link and at least one outgoing link. A starting node (600) has at least one outgoing link and an ending node (620) has at least one incoming link. The probabilities of the outgoing links (637, 633) of a node must sum to one as the WFST represent all the possible sequences of events for a given unit of work. It is possible for a link to loop back to the same state or to a previous state.

Each event represented in figure has been generated by a monitored IT system. Although not represented in figure 6, data can be exchanged between different monitored IT systems. An event corresponding to this data exchange can be generated or not. All events generated by the IT systems may not be represented in the weighted final state machine and can have been previously filtered by the method according to the invention. Weighted Finite State Transducer (WFST) are used to represent the behaviour of a cluster of units of work. The WEST shown in figure 6 is an example of a learnt behaviour model (225), output of the behaviour models learning engine (130). A WFST corresponds to a particular cluster node such the node "customer=C1" (410) described with respect to figure 4. II is thus important to reduce the search space during the clustering phase, described with respect to figure 5, as it will direct the number of WFST to manage. Each state in the WFST represents an event generated by an IT system and each transition represents a legal transition between two events. In the business domain transition corresponds to an activity, in the web service domain, the transition corresponds to the execution of a web service. Each event is assigned a weight proportional to frequency of occurrence in the past instances. The legal sequences of events are specified by the arc labels along each complete path, and their probabilities by the product of the corresponding transition probabilities. Figure 6 shows the WEST of a sample sales order process. The sales order process is a business process. Epsilons represent no-event transition.

Whilst the use of WEST is preferred, other stochastic processes can be used to implement the main embodiment, such as Markov Chains, Hidden Markov Model, Dynamic Bayesian Networks, state-space models, Wiener process, etc. Figure 7 shows an overview of the data model in an implementation of the main embodiment, comprising: -behaviour model definition (700); -completed units of work (710); -units of work clusters (720); -learnt WFSTs (730); and -ongoing units of work (740).

The behaviour models definition (700) is the starting point defining the set of legal state sequences. For a given behaviour models definition (700) there exists any number of completed or ongoing units of work (710, 740) that are generated from this behaviour models definition. These units of work (710, 740), based on some of their attributes, fall into some clusters (720). For a given cluster (720), there is at least one unit of work (710 or 740) with an attribute value corresponding to this cluster. As the clusters (720) can overlap, a unit of work can correspond to several clusters. For each cluster (720) is associated a behaviour model, or WFST (730) in an implementation of the main embodiment, which maps onto the behaviour models definition (700) with extra information related to the probability that a transition occurs (for instance 637).

The data model shown in figure 7 can be implemented in various databases. Whilst a classic relational database can be used for storing this data, as events are often modelled in XML, a database with advanced XML functionalities can also be used. The exact list of fields for the data model can be derived easily by the person skilled in the art.

Figure 8 shows a method for learning a WFST associated with a cluster, in an implementation of the present invention. The method comprises the following sequence of steps: -for each unit of work, capture sequence of events S (835); -construct a finite state transducer T for S (840); -assign weights proportional to sequence frequency (850); -union T, i1..N with Ta11 (860); -determinize (870); -push weights toward the start state (880); -add WFST's to repository (890).

Learning the WFST associated with a cluster of units of work involved to steps: running the sequence of events associated with each cluster and learning the weights associated with each event. The sequences of events in the WFST of a specific cluster can be learnt from the events log by taking the union of the events sequences of the units of work belonging to this cluster.

In some application domains, the set of possible sequences are known in advance. This domain specific knowledge can be useful in constructing the WFST. For example in the business process domain, the method can use the process description to extract the legal sequences of events and use these sequences to construct a finite state transducer Taji. The finite state transducer extracted from the sample business process description is described in more details with respect to figure 9.

The method uses Taii and the behaviours of completed units of work (710) that belong to some cluster (720) to learn a behaviour model (730) for that cluster. The method first captures (835) the flow of events executed by each instance in the cluster from the events log. The behaviour of each instance is represented (840) by a finite state transducer, T1. A finite state transducer T that represents the sequence of events associated with a sample unit of work is described in more details with respect to figure 10.

The method assigns (850) for each transducer T1 a weight proportional to the frequency of occurrence of the associated instances in the past. This weight represents a measure of normality for the transducer T. The method merges the finite state automata of past instances into a deterministic weighted finite state automaton as follows: -Assign for each legal sequence in Taii a small weight (a small measure of normality), an example of this step is described with respect to figure 1 la; -Union (860) all transducers T, i = 1. .N with Tan, an example of this step is described with respect to figure 1 ib; -Determinize (870) the resulting transducer, an example of this step is described with respect to figure 11 c; and -Push the weights toward the start state (880),an example of this step is described with respect to figure lid.

The above steps produce a WFST that contains all legal sequences that may be executed by the process. Each legal sequence is assigned a weight proportional to its frequency of occurrence in the past instances. The legal sequences in Taii that do not appear in the past instances are assigned a small normality measure compared to other sequences. The determinization step is required such that the resulting WFST has a single normality measure for each legal sequence. Pushing the weights towards the start state allows the method to predict anomalous behaviour patterns as discussed with respect to figure 12.

The resulting WFST represents the normal behaviour of the process learnt from information about past instances.

All the described steps are not mandatory, and an implementation of the main embodiment can comprise fewer steps. Thus the step of pushing the weights toward the start state is only here to allow for easier detection of anomalous processes, and this embodiment could be implemented without this step.

Figure 9 shows a finite state transducer representing all the legal sequences of events in a behaviour model definition, in an implementation of the main embodiment. The transducer shown in figure 9 corresponds to a behaviour template specification (212) and is the first input for learning the behaviour model (225) and producing eventually the WFST described with respect to figure 6.

Figure 10 shows a finite state transducer representing the behaviour of a unit of work, in an implementation of the main embodiment.

This sequence of states is a subset of all the possible sequences of steps, represented by a behaviour template description, with an example described with respect to figure 9.

Whilst figure 10 shows a linear sequence of states, it is possible to have loops in the process and hence to have one or several states appearing several times in a particular process instance.

Figure 11 a shows the finite state transducer Taii representing all the legal sequences of events, each legal sequence an arbitrary small weight, in an implementation of the main embodiment.

The goal is determine the likelihood that a particular sequence of events is detected. The total number of difference sequences is, if there are no loops, equal to the number of final states. In the first step of determining the behaviour model for a unit of work, all the sequences are a priori equally likely or unlikely. Hence all transitions or activities are assigned a small probability, without normalizing to 1 the transitions leaving from a given state.

Figure 1 lb shows the union of all the transducers representing a possible sequence of states in the unit of work, in an implementation of the main embodiment, comprising: -Tall (1101) with small probabilities for each sequence it represents, as described with respect to figure 11 a; -The transducers T (1105, 1110, 1115) resulting from the analysis of the completed units of work easily.

Each transducer T1 has a weight proportional to the frequency of occurrence of the corresponding path. When a possible sequence of events doesn't occur, it has no corresponding T transducer and is given a small weight in the Tan transducer (1101).

For instance, the sequence of states (1105) was counted once out of a total of 80 process instances, it is thus given the weight 0.0125 (1106). Similarly, the sequence of states (1110), counted 36 times, is given the weight 0.45 (1111); and the sequence of states (1115), counted 43 times, is given the weight 0.5375 (1116).

As the sequences in Ta11 (1101) are assigned an arbitrary weight, the sum of all weights are slightly above 1. This will be compensated in the step described with respect to figure lid.

Figure ii c shows the resulting transducer with the probability of occurrence of each path, in an implementation of the main embodiment. With the unit of work exemplified with respect to figure 9, there are 5 possible paths: -a path corresponding to sequence of events (1105) with state (1125) as final state with a probability of occurrence of 0.0125, similar to (1106); -a path corresponding to sequence of events (1110) with state (1130) as final state with a probability of occurrence of 0.45, similar to (liii); -a path corresponding to sequence of events (1115) with state (1135) as final state with a probability of occurrence of 0.5375, similar to (1116); and -two paths corresponding to sequences of events with no support in the completed unit of work, with states (1121) and (1122) as final states, and with weights derived from Ta11 weights.

Given the transducer shown in figure lie, it is then possible the likelihood of occurrence of any sequence of states a posteriori, i.e. once the complete sequence is executed.

Figure lid shows the WFST with the weights pushed toward the start, in an implementation of the main embodiment. The transducer comprises: -a first transition (1140) which occurs in any legal sequence; -two transitions (1141 or 1142) which occur alternatively, with corresponding weights summing to 1; -after state marked "2", following activity (1142), three activities are possible, with two (1145, 1146) which are not supported in past instances, hence with an arbitrary weight (0.00001), small in comparison of the activities supported in the past instance, whose weights add to 1; and a third activity (1147) with weight epsilon to represent a no event transition to state marked "3"; -after state marked "3", activities (1150) and (1152) have respectively weights 0.544304 and 0.455696, which sum to 1.

Thus, it is possible to know for each event the likelihood that it occurs, without waiting for the unit of work to finish up. Detection and prediction of anomalous behaviour of an IT system using a WFST such as the one shown in figure lid will be described in more details with respect to figures 12 and 14.

The events weights are computed so that the probability that a path occurs, given by the product of the path activities weights, is equal to its occurrence frequency, as described with respect to figure 1 lb. Figure 12 shows a method for monitoring executed units of work using WFST and measuring their normality, in an implementation of the present invention. Given a completed unit of work named "i", the method comprises the steps of: -capturing (1200) the sequence of events of the next instance "i"; -constructing (1210) the corresponding finite state transducer T; -composing (1220) T with the learnt WFST to produce a transducer T; -calculating (1230) the likelihood as the sum of weights along the sequence of events inTo; -checking (1240) whether the computed likelihood is above or below a threshold, and branch to first step on a new unit of work instance if the likelihood is above, as the process instance is then normal; -reporting (1250) anomalous behaviour if the computed likelihood is below the threshold.

In the run-time phase, the method uses the learnt WFSTs to evaluate ongoing units of work and classifies as anomalous any deviation from that behaviour. An implementation of the main embodiment allows the detection of anomalous patterns in completed unit of work (off-line monitoring) and the prediction of anomalous patterns in ongoing units of work (on-line monitoring).

In off-line monitoring, the method shown in figure 12 uses the learnt WFSTs to evaluate executed units of work instance. The method is repeated for each cluster.

To calculate the probability that an instance with the associated properties/attributes exhibits the observed behaviour using WFST operations, the observed behaviour of the executed units of work is represented as a finite state transducer, T, with input and output labels representing the observed events. The finite state transducer of the observed behaviour, T, is then composed with the learnt WFST. This step keeps only the sequence of events observed during the monitoring of instance along with the weight of each event. This measure of normality is calculated as the product of weights along the sequence of event in the resulting WFST.

Figure 13a and 13b show the result of the composition step for the above sample business process.

Computation of the threshold value using in step (1240) is described in more details with respectto figure 15.

Figure 13a and 13b illustrate the computation of the measure of normality of a sequence of events, in an implementation of the main embodiment. Figure 13a shows an example of a transducer T associated with an executed process instance, as described with respect to figure 12. Figure 13b shows the result of the composition step (1220) described with respect to figure 12. The composed transducer comprises only the states of the executed instance with the weights of the learnt transducer (1141), described with respect to figure lid.

Figure 14 shows a method for monitoring ongoing units of work using WFST and measuring their normality, in an implementation of the main embodiment. The method comprises the steps of: -capturing (1400) the next event from an ongoing unit of work; -checking (1410) whether it is the first event in the unit of work; -if it is, initializing (1420) the normality measure of the instance to zero; -searching (1430) the learnt WFSTs for the matched event; -updating (1440) the normality measure; -checking (1450) whether the normality measure is below a given threshold; -reporting (1460) an anomalous behaviour if it is, passing to the next event if not.

In on-line monitoring, the method predicts the anomalous behaviour before the completion of the unit of work instance. The computation of the normality measure is similar as for the completed unit of work, the main difference being that the path doesn't stop at a final state.

Hence only the weights associated with already detected events will be taken into account to measure normality.

The method shown in figure 14 is repeated for each cluster.

The computation of the threshold value using in step (1450) is described in more details with respect to figure 15.

Figure 15 illustrates the computation of the threshold for the normality measure, in an implementation of the main embodiment. Figure 15 shows a histogram comprising: -the probabilities of a path in a WFST as abscissa (1500); -the number of paths having each probability (1510); -a first group of paths (1520); -a second group of paths (1525); -a distance between the two groups (1530).

A WFST has a limited number of possible paths. For instance, the WFST shown in figure lid has 5 possible paths. Each possible path has a probability of occurrence, determined by the analysis of the completed units of work. The computation of the likelihood of each path is described with respect to figures lic.

In an implementation of this embodiment, the system uses the histogram of the probabilities of all paths to calculate a threshold value for each cluster as follows: 1. The system identifies two regions in the histogram such that -The sum of probabilities in one region (1520) is greater than a first threshold Ti (say 90%) (this means that the sum of probabilities in the other region (1525) is less than < i-Ti) -The distance (1530) between the two regions is greater than a second threshold T2 (say 20%); where Ti and T2 are predefined values that can be tuned during the run-time phase based on feedback from users, adaptative analysis, number of paths, etc. 2. If these two regions do not exist, remove this cluster from the clusters space (this means that all paths in this cluster are normal).

3. Take the threshold value T between these two regions as follows: T = Max(Region2) + (Min(Region 1)-Max(Region2)) * SUM(Prob of Region2) Choosing threshold values is a trade off between early detection of anomalous behaviour and the probability of false alarms. High threshold values allow early detection of anomalous units of work but increase the probability of false alarms while low threshold values decrease the probability of false alarms but cause late detection of anomalous units of work.

Different scales, such as a logarithmic scale, can also be used for the abscissa, so as compensate for too many paths having a probability value close to 0: when a transducer has a great number of possible paths, the likelihood of each path tends to be smaller as they must all sum to 1. The threshold values (Ti, T2, T) can be adjusted taking into account the scale used for the probabilities axis (1500).

The method can also improve the learnt models by using feedback about normal and anomalous units of work with anomaly measures close to the threshold (i.e., units of work with low confidence). The method presents these units of work to domain experts and asks them to classify these instances as normal or anomalous. When the number of low confidence units of work exceeds a predefined threshold, the method repeats the learning process using the events logs of all completed units of work is.

Another embodiment comprises a method for monitoring the components of an IT system, comprising the steps of detecting an event generated by a component, analyzing said event to identify the behaviour model to which it belongs, computing the likelihood of occurrence of said event based on said behaviour model and on one or more attributes associated with said event, determining whether a component of said IT system is in an anomalous state or not based on said likelihood of occurrence; and it provides a method for learning a behaviour model, comprising the steps of receiving a user defined behaviour template representing a sequence of events potentially generated by the components of said IT system, detecting events generated by a component of said TT system, determining the behaviour template to which said events belong, aggregating said events into a unit of work so as to reconstruct a path of said behaviour template, determining the frequency of occurrence of said path, taking into account one or more attributes associated with said events, organising the set of units of work into different clusters, and determining for said behaviour template the likelihood of occurrence of a particular path in said behaviour template as a function of the frequency of occurrence so determined, computing the likelihood of occurrence of a node in said behaviour template as a function of the likelihood of occurrence of all the paths in the behaviour template.

Yet another embodiment detects a hidden potentially critical situations while the business is performing, and not after a particular unit of work has finished executing. Moreover the embodiment automatically learns the normal behaviour from the events log of the components, and does not depend on predefined rules nor on labelled data. The embodiment is not limited to a specific domain, it is a generic solution for anomaly detection that can be customised to detect anomalies in various domains such as business process monitoring, fraud detection, system and network management, autonomic computing etc. To monitor the behaviour of heterogeneous components, it is required to access and understand log events of these components in a standardised way. An embodiment of the invention uses the common event infrastructure technology and the common base events model to provide such a way. The common base events models is an open standard that defines XML format for log messages. The common event infrastructure is a framework for providing a consistent and unified interface for the creation, transmission, persistence and distribution of these Common base events.

The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc. Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk -read only memory (CD-ROM), compact disk -read/write (CD-R/W) and DVD.

A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus.

The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.

Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.

Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.

Claims

CLAIMS1. A method for generating a behaviour model (225) , said behaviour model (225) comprising a directed graph (212), each of the nodes of said directed graph representing an event generated by an application, said method comprising the steps of: -receiving a set of at least one event (222) generated by said application, said set corresponding to a path in said directed graph and being associated with an attribute; -storing said set associated with said attribute in a database; -determining the frequency of occurrence in said database of said set associated with said attribute; -assigning a weight to said corresponding path in said directed graph, said weight being a function of the frequency so determined (850) ; and -determining a probability of occurrence of a node in said path as a function of said weight (880).
2. The method of claim 1 further comprising the steps of: -identifying a first group of at least one said set of events in said database associated with said attribute, the total probability of occurrence of said first group being greater than a first predefined value; -identifying a second group of at least one said set of events in said database associated with said attribute, the total probability of occurrence of said second group being smaller than a second predefined value; and -determining the difference between the probability of occurrence of the least likely set of events of the first group and the probability of occurrence of the most likely set of events of the second group.
3. A method for monitoring a computer system, said computer system comprising an application, said application being configured to generate a message, said message being compliant with a predefined message structure, said method comprising the steps of: -receiving a message generated by said application (1400); -identifying a behaviour model according to claim 1 (1430), wherein one event in the directed graph of said behaviour model corresponds to said received message generated by said application; -determining a likelihood of occurrence of said message, as a function of said behaviour model so identified (1440); and -deciding whether said computer system is in an anomalous state, as a function of a comparison between said likelihood of occurrence and a predefined threshold (1450).
4. The method of claim 3 wherein said predefined message structure is the Common Base Event schema.
5. The method of claim 3 wherein said computer system comprises a second application, configured to generate a second message, compliant with said message structure, and comprising the further steps of: -receiving a set of at least a first and a second events generated respectively by said application and said second application, said set corresponding to a path in said directed graph and being associated with an attribute.
6. The method of claim 3 wherein said predefined threshold is determined as a function of said difference according to claim 2 and said first second value according to claim 2.
7. A method for monitoring a computer system, said computer system comprising an application, said application being configured to generate a message, said message being compliant with a predefined message structure, said method comprising the steps of: -generating a behaviour model (225), said behaviour model (225) comprising a directed graph (212), each of the nodes of said directed graph representing an event generated by an application, said step of generating a behaviour model comprising the further steps of: -receiving a set of at least one event (222) generated by said application, said set corresponding to a path in said directed graph and being associated with an attribute, -storing said set associated with said attribute in a database, -determining the frequency of occurrence in said database of said set associated with said attribute, -assigning a weight to said corresponding path in said directed graph, said weight being a function of the frequency so determined (850), and -determining a probability of occurrence of a node in said path as a function of said weight (880); -receiving a message generated by said application (1400); -identifying a behaviour model according to claim 1 (1430), wherein one event in the directed graph of said behaviour model corresponds to said received message generated by said application; -determining a likelihood of occurrence of said message, as a function of said behaviour model so identified (1440); and -deciding whether said computer system is in an anomalous state, as a function of a comparison between said likelihood of occurrence and a predefined threshold (1450).
8. An apparatus comprising means adapted for carrying out each step of the method according to any one of the claims 1 to 7.
9. A computer program comprising instructions for carrying out the steps of the method according to any one of claims 1 to 7 when said computer program is executed on a computer.
10. A computer readable medium having encoded thereon a computer program according toclaim9.