WO2022149149A1

WO2022149149A1 - Artificial intelligence with dynamic causal model for failure analysis in mobile communication network

Info

Publication number: WO2022149149A1
Application number: PCT/IN2021/050010
Authority: WO
Inventors: Monda SURAJIT; Sen AYAN; Jain DEVANSH; Bose SAYAN; Singh SHIKHA; Mani ARUNA
Original assignee: Telefonaktiebolaget LM Ericsson AB
Current assignee: Telefonaktiebolaget LM Ericsson AB
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2022-07-14
Anticipated expiration: 2023-07-05

Abstract

Artificial intelligence (AI) in the area of failure analysis in a mobile communication network is realized by means of probabilistic causal inference performed over a Bayesian network formed with the help of a dynamic causal model comprising of both static features and influential dynamic features. Static features are identified from previously known network 5 failure signature and based on location of users within the network. Dynamic features are extracted as information element configurations from the first point of failure in signaling of abnormal session records and further reduced to a few of the most influential features with a score-based mechanism and presence-value pattern correlation.

Description

ARTIFICIAL INTELLIGENCE WITH DYNAMIC CAUSAL MODEL FOR FAILURE ANALYSIS

IN MOBILE COMMUNICATION NETWORK TECHNICAL FIELD

The present disclosure relates generally to failure analysis to determine the cause of network failures and, more particularly, to artificial intelligence with dynamic causal model for failure analysis in mobile communication network

BACKGROUND

End-user experience is one of the major factors that drives business growth for any mobile communication service provider in a given market. There are many factors that govern and influence the end-user experience. Quality of the products used in constructing the network, design of the network, product functionalities, available soft and hard infrastructures are few of the many factors. Despite considerable investment, no mobile communication network is 100% error free. End-user behavior, dependency on wireless coverage and associated issues, operational activities and challenges, issues with network configuration and unexpected behavior in product function can cause various type of failures in the network. Continuous optimization of the network is the standard operating procedure for handling such situation. Accurate failure analysis and problem identification leads to appropriate corrective action and is the key behind network optimization.

Multiple methods are in use for failure analysis in mobile communication network.

The most used methods are rule-based. These rules are acquired or generated based on theoretical understanding of the technology, technical product description and/or historical data. These rules are generally implemented or realized in form of if-then-else conditions with static thresholds or values on various performance metrics and configuration data.

Other popular techniques involve the usage of different types of classification models which are trained based on data labelled with known problems.

For example, an automatic method for root cause analysis of user disconnections has been presented by Ana Gomez-Andrades et al., in Automatic Root Cause Analysis Based On Traces For Lte Self-Organizing Networks which in particular details an approach based on traces to identify the cause of the release due to specific RF issues - coverage hole, cell edge performance issue, lack of dominant cells, interference and mobility issues. The result of the classification is assessed through a rule-based system. The described method uses various performance metrics and channel characteristics available from network traces. Traces are preferred choice of input data for failure analysis over node level performance data, due to the higher depth and granularity of available indicators.

KONG Qing-jun et al., in Network Fault Positioning Method, Device And Equipment And Medium describes a method where a signaling analysis is performed followed by classification of identified signaling reason further into a wireless reason using a predefined set of features.

Rana M. Khanafer et al., describes another novel method in Automated Diagnosis For Umts Networks Using Bayesian Network Approach for automated diagnosis in troubleshooting for UMTS networks using a Bayesian network (BN) approach. An automated diagnosis model is first described using the Naive Bayesian Classifier, which is later trained using performance counters and key performance indicators (KPIs) from real Universal Mobile Telecommunications Service (UMTS) networks.

In Wireless Signaling Analysis Method And Device, Computing Equipment And Storage Medium, LI Ji-guang describes a method for signaling analysis in wireless network using adirected acyclic graph (DAG) with a support vector machine (SVM) and addresses the challenges in a rule-based system. In this method, specific set of features are extracted from wireless signaling such as serving cell reference signal received power (RSRP), serving cell physical cell identity(PCI), an uplink signal-to-noise ratio(SINR), transmit power, power headroom report (PHR), Routing Information Protocol (RIP), RSRP of the strongest neighbor cell, the strongest neighbor cell’s PCI and whether the strongest neighboring cell and the serving cell are of the same frequency. These specific set of features are used by the SVM to classify a failed session in one of the category -uplink interference, downlink weak coverage, uplink coverage limitation, uplink instantaneous interference, neighbor cell deletion, PCI mode 3 interference, overlapping coverage, measurement and control parameter abnormity, a serving cell missing relation with the strongest neighbor cell and, a wireless network normal.

KIM Bo-Seop et al. , in Apparatus And Method For Analyzing Cause Of Network Failure, describes a method of learning the relationship between the cause of the failure and the alarms through a neural network and generates a probability vector for cause of failure of the equipment for failure event. A failure cause analysis unit is realized based on a second neural network that learns the relationship between the network tensor and the failure cause. The described method addresses the challenges in providing comprehensive judgment within short time considering not only the alarms generated by the network management system (NMS) but also the network topology according to the connection relationship of the network equipment and the number of nodes within the network.

In Optimizing Radio Cell Quality For Capacity And Quality Of Service Using Machine Learning Technique, Robert William Froehlich et al., describes a method for optimizing a radio access network through root cause analysis performed using an automated classification model based on the correlated plurality of network monitoring parameters. The classification model comprises of at least one Bayesian network model constructed using one or multiple decision trees which are again designed based on the rule set containing initial preset confidence factors (probabilities) according to domain experts.

Another proposed method by Robert William Froehlich in Knowledge Base Radio And Core Network Prescriptive Root Cause Analysis, involves network performance and configuration data associated with the identified network elements, is analyzed to identify one or more causes of the reported network failures. A root cause analysis of the reported network failures is performed using knowledge and statistical inference models for each of the identified causes to provide at least one recommendation for resolving the reported network failures like hardware or software faults, cell edge traffic density causing failures, etc. The method also describes the identification of control plane (CP) and user plane (UP) signaling issues based on pre-defined rules or signature. Yoshinori Watanabe et al. , in Communication Network Failure Cause Analysis System, Failure Cause Analysis Method, And Failure Cause Analysis Program describes a failure cause analysis system for estimating the cause of a failure in a communication network by extracting a statistical feature of the recorded contents at the time of occurrence of a failure followed by failure cause estimation based on the correspondence relation between a statistical feature of the recorded contents that is acquired at a time of occurrence of a past failure with a known failure cause and the statistical feature of the recorded contents that is acquired at the time of occurrence of the failure. The proposed method exploits time-based correlation of feature states or values with the failure event.

While these prior art techniques can be useful in identifying problems based on known causes, they are hindered by one or more of the following limitations.

1. Not all network failures demonstrate the behavior of time-based coherent variation in the form of statistical features, a pre-decided set like node performance, configuration data, etc.

2. The method attempts to identify some specific set of problems using various data sources like node performance data, configuration data, signaling data etc. However, in Optimizing Radio Cell Quality For Capacity And Quality Of Service Using Machine Learning Technique, some static predefined thresholds are used to design the decision trees and infer the possible cause of failure. These static pre-defined thresholds can be applied only for the well- known problems.

3. The method attempts to match the known symptoms observed through node performance data, signaling data etc. These methods attempt to use the data labelled with known problems only as reference for matching the symptoms or training a classification model. Also note that the selection of the features is purely based on the prior knowledge of the problem and the relation between features and the problems. 4. Alarms observed through the network management system (NMS) can detect only specific type of hardware or software fault. Various type of network outages can also be detected through alarm data. However, there are many network failures that remains undetected through network alarms. These alarms limit possible outcomes due to the adopted method of using network alarm data in conjunction with network topology data.

5. The method solely focuses on the radio link failures due to bad radio frequency (RF) conditions.

6. In methods dependent on a classification model, it is difficult to determine an explanation, which can be important for deciding the necessary corrective action.

Accordingly, there remains a need for failure analysis techniques that avoid one or more of these limitations.

SUMMARY

The present disclosure relates to a novel method for realization of artificial intelligence (Al) in the area of failure analysis in a mobile communication network by means of probabilistic causal inference performed over a Bayesian network formed with the help of a dynamic causal model comprising both static features and influential dynamic features. Static features are identified from previously known network failure signatures and based on the location of users within the network. Dynamic features are extracted as information element configurations from the first point of failure in signaling of abnormal session records and further reduced to a few of the most influential features with a score-based mechanism and presence-value pattern correlation. Thus, the heuristic nature of the proposed solution allows accurately tracing the cause of network problems under both known and “Unknown classes” and suggest right direction in corrective optimization actions.

A first aspect of the present disclosure comprises computer implemented methods for performing failure analysis to determine a cause of a network failure in a mobile communication network. The method comprises obtaining session records for a plurality of network nodes in the mobile communication network and creating, from the session records, a list of sessions including, for each session, a termination cause and an indication whether the termination cause is normal or abnormal. The method further comprises, for at least one abnormal termination cause, creating a dataset including subset of the session records for a plurality of abnormal sessions and a plurality of normal sessions, identifying and categorizing unique sequences of signaling messages in the dataset, performing signaling sequence analysis for each unique sequence of signal messages to identify a first point of failure and a last successfully transmitted or received signaling message and extracting a dynamic feature set from the last successfully transmitted or received signaling message. The method further comprises generating a dynamic causal model based on the dynamic feature set and determining a cause of failure for the abnormal termination cause using the dynamic causal model. In some embodiments, determining a cause of failure for the abnormal termination cause using the dynamic causal model comprises generating a Bayesian network based on the dynamic causal model and performing probabilistic causal inference over the Bayesian network..

A second aspect of the disclosure comprises a computing system configured to determine a cause of a network failure in a mobile communication network. The computing system is configured to obtain session records for a plurality of network nodes in the mobile communication network and create, from the session records, a list of sessions including, for each session, a termination cause and an indication whether the termination cause is normal or abnormal. The computing system is further configured to, for at least one abnormal termination cause, create a dataset including subset of the session records for a plurality of abnormal sessions and a plurality of normal sessions, identify and categorize unique sequences of signaling messages in the dataset, perform signaling sequence analysis for each unique sequence of signal messages to identify a first point of failure and a last successfully transmitted or received signaling message and extract a dynamic feature set from the last successfully transmitted or received signaling message. The computing system is further configured to generate a dynamic causal model based on the dynamic feature set and determine a cause of failure for the abnormal termination cause using the dynamic causal model. In some embodiments, the cause of failure is determined by generating a Bayesian network based on the dynamic causal model and performing probabilistic causal inference over the Bayesian network.

A third aspect of the disclosure comprises a computing system configured to determine a cause of a network failure in a mobile communication network. The computing system comprises communication circuitry for communicating over a communication network with other network nodes in the mobile communication network and processing circuitry operatively connected to the communication circuitry. The processing circuitry is configured to obtain session records for a plurality of network nodes in the mobile communication network and create, from the session records, a list of sessions including, for each session, a termination cause and an indication whether the termination cause is normal or abnormal. The processing circuitry is further configured to, for at least one abnormal termination cause, create a dataset including subset of the session records for a plurality of abnormal sessions and a plurality of normal sessions, identify and categorize unique sequences of signaling messages in the dataset, perform signaling sequence analysis for each unique sequence of signal messages to identify a first point of failure and a last successfully transmitted or received signaling message and extract a dynamic feature set from the last successfully transmitted or received signaling message. The processing circuitry is further configured to generate a dynamic causal model based on the dynamic feature set and determine a cause of failure using the dynamic causal model. In some embodiments, the cause of failure is determined by generating a Bayesian network based on the dynamic causal model and performing probabilistic causal inference over the Bayesian network.

A fourth aspect of the disclosure comprises a computer program comprising executable instructions that, when executed by a processing circuitry in a computing system, causes the computing system to perform the method according to the first aspect. A fifth aspect of the disclosure comprises a carrier containing a computer program according to the fourth aspect, wherein the carrier is one of an electronic signal, optical signal, radio signal, or computer readable storage medium.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 illustrates an exemplary mobile communication network.

Figure 2 illustrates the main functional components in a core network of the mobile communication network.

Figure 3A illustrates an exemplary distribution of failure causes during bearer setup.

Figure 3B illustrates one example of performance degradation during context set up.

Figure 4 illustrates an exemplary computer-implemented procedure for failure detection in a mobile communication network.

Figure 5 illustrates a computer-implemented procedure for determining a predefined set of features for a dynamic causal model.

Figure 6A illustrates a computer-implemented procedure for determining location features for a dynamic causal model.

Figure 6B illustrates variables for determining a non line of sight probability.

Figure 7 illustrates a procedure for determining dynamic features for a dynamic causal model.

Figure 8 is an example of dynamic feature extraction.

Figure 9 is an example of an aggregated table of dynamic features.

Figure 10 is an example of a reference file for determining unique sequences of signaling messages

Figure 11 is an example showing how a first point of failure is determined.

Figure 12 is an example of a parsed external message detail with information elements structured in JavaScript Object Notation (JSON) format.

Figure 13A is an example illustrating the path of an information element in the parsed external message detail shown in Figure 11 . Figure 13B is an example illustrating the unique identifier for the path of an information element.

Figure 14 is an example of significance scores for a set of information elements and associated unique identifiers used to identify a set of dynamic features.

Figure 15 is an example of dimensionality reduction of the set of dynamic features.

Figure 16 illustrates the structure of a dynamic causal model used for failure analysis in a mobile communication network.

Figure 17 is an example of a dynamic causal model used in analyzing initial context set up failures in a mobile communication network.

Figure 18A illustrates the inference method for determining cause of an initial context setup failure.

Figures 18B and 18C illustrate information elements associated with dynamic features used in the casual model which are potential causes of the initial context setup failure.

Figure 19 illustrates a computer-implemented method of determining the cause of a network failure in a mobile communication network.

Figure 20 illustrates a computing system configured to determine the cause of a network failure in a mobile communication network.

Figure 21 illustrates a computing system configured to determine the cause of a network failure in a mobile communication network.

DETAILED DESCRIPTION

The present disclosure relates generally to failure analysis for determining a cause of a network failure in a mobile communication network. The failure analysis techniques are described in the context of a Fifth Generation (5G) communication network according to the Third Generation Partnership Project (3GPP). Those skilled in the art will appreciate that the techniques are also applicable to other communication standards, such a Long Term Evolution (LTE) and Wideband Code Division Multiple Access (WCDMA) to name a few. A network failure in the context of a mobile communication network comprises an event or observation where the assignment of a value or set of values to a variable or set of variables, which are attributes of one or multiple nodes within network or characteristics of the communication channel used by the end-user, results in an exception in protocol and/or unexpected behavior of the communication channel and this leads to poor end-user experience. For example, a wrong configuration of a network parameter can cause failures during data bearer setup resulting in poor end-user experience in terms of network accessibility. Similarly, poor inbuilding coverage causing network coverage issues can impact achievable downlink user throughput negatively and result in poor end-user experience in terms of network integrity.

A network trace comprises a trace collected from a network node (such as an Evolved NodeB (eNB)) comprises a) external signaling messages, defined by 3GPP, sent and received over various interfaces (e.g., X2, S1 , eUu) and b) internal messages specific to a vendor and product version. The network trace is sometime loosely referred to as signaling data.

A failure cause is the cause registered by a network node, responsible for capturing a network trace, through external and/or internal messages, during an abnormal release of a session or an abnormal outcome of a procedure resulting into a network failure. A normal category of cause is also registered in the network traces for all the normally released sessions or sessions with no failure. Based on the domain knowledge and product information, these causes can be categorized into two broad categories - normal and abnormal. For simplicity, this information which carries the cause (normal/abnormal) is referred to herein as the “Termination Cause.”

Session records are records of communication sessions extracted from the network traces. Session records can be considered as the rectangular data that contains various calculated or extracted fields with the help of external and/or internal messages and associated details available in network traces. Commonly available fields in typical session records generated from network traces include the session identifier, user identifier, various radio environment related metrics, termination cause, cause category of sessions (i.e. normal or abnormal), etc.

Information Elements (I Es) . sometime referred to as fields, comprises information that is exchanged as part of external signaling messages. Common lEs include, for example, a subset of network configurations. In the current application, IE and network configuration are used interchangeably in the description and diagrams.

A network node refers to a node in the mobile communication network responsible for collecting a network trace and the network failures captured through traces, which need to be analyzed to improve the performance of the network node. A base station or eNB is an example of a network node.

A user equipment (UE) is a wireless device used by the subscribers or end user of the mobile communication network.

The terms source network node and target network node refers to the network nodes involved in a handover of a UE from a first network node (the source network node) to a second network node (the target network node). Various external messages are exchanged between these two nodes according to protocol defined by 3GPP.

Node performance data comprises the performance counters available from network nodes and used to construct various performance indicators for monitoring and analysis purpose.

Configuration data comprises network configurations or parameters.

Alarm data comprises alarms defined by the vendor for a specific product version.

Physical site data comprises the physical attributes of the network nodes such latitude, longitude, height, tilt, antenna azimuth, etc. Physical site data can also be referred to as network topology data.

Figure 1 illustrates an exemplary mobile communication network 10. Generally, the mobile communication network comprises a plurality of network nodes 25 in a radio access network (RAN) 20 connected to a core network 30. In 5G networks, the network nodes 25 are referred to as 5G NodeBs (gNBs). In LTE network, the network nodes are referred to as Evolved NodeBs (eNBs). Network nodes 25 may also be referred to generically as base stations. The network nodes 25 communicate with user equipment (UEs) 15 and provides connection to the core network 30.

Figure 2 illustrates the main functional components in a 5G core network (5GC) 30.

In one exemplary embodiment, the 5GC 30 comprises a plurality of network functions (NFs), such as a User Plane Function (UPF) 35, an Access And Mobility Management Function (AMF) 40, a Session Management Function (SMF) 45, a Policy Control Function (PCF) 50, a Unified Data Management (UDM) function 55, a Authentication Server Function (AUSF) 60, a Network Exposure Function (NEF) 70, a Network Repository Function (NRF) 75, a Network Slice Selection Function (NSSF) 80, a Uniform Data Repository (UDR) 85, a Network Data Analytics Function (NWDAF) 90 for generating and distributing analytics reports and an Application Function (AF) 95 . These NFs comprise logical entities that reside in one or more core network nodes, which may be implemented by one or more processors, hardware, firmware, or a combination thereof. The functions may reside in a single core network node or may be distributed among two or more core network nodes.

The UEs 15 may comprise any type of communication device equipped with a transceiver for communicating with the network nodes 25. Exemplary UEs 15 comprise cellular telephones, smartphones, tablets, laptop computers, machine type communication (MTC) devices, device to device (D2D) communication devices, etc.

End user experience is one of the major factors that drives business growth for any mobile communication service provider in a given market. There are many factors that govern and influence the end user experience. Quality of the products used in constructing the network, design of the network, product functionalities, available soft and hard infrastructures are a few of the many factors. Despite considerable investment in all of the above and additional factors, no mobile communication network is 100% error free. End-user behavior, dependency on wireless coverage and associated issues, operational activities and challenges, issues with network configuration and, unexpected behavior in product function can cause various type of failures in the network. Continuous optimization of the network is the standard operating procedure for handling such situations. Accurate failure analysis and problem identification leads to appropriate corrective action and is the key behind network optimization.

Many methods are in use for failure analysis in mobile communication networks. The most used methods are rule-based. These rules are acquired or generated based on theoretical understanding of the technology, technical product description and/or historical data. These rules are generally implemented or realized in form of if-then-else conditions with static thresholds or values on various performance metrics and configuration data.

Existing methods of failure analysis work best when the relationship between the network failure and causes of the failure are known based on a priori information. However, there are many scenarios when the network failure cannot be explained or accounted for by the known causes of the failure. In these circumstances, the existing methods of failure analysis may fail to identify the cause and lead to misdirected corrective actions that fail to solve the problem or even worsen the problem.

Figure 3A illustrates a typical distribution of failure causes during bearer setup that can be detected from a trace-based analysis. Some of the causes (e.g., ‘csfb_license_missing’) are self-explanatory. These self-explanatory causes explain the problem and help to decide the necessary corrective actions. However, the majority of failure causes (e.g.,‘failure_in_radio_procedure’, ’ue_capability_enquiry_timeout’ etc.), only help to explain the point of failure according to 3GPP protocol definition and whether a session is abnormal/failed. These causes don’t lead to the specific problem that caused the failure and hence one may not be able to take necessary corrective actions.

Figure 3B illustrates one example of degradation in context setup success rate after implementation of a feature that enables inter-frequency automatic neighbor relation (ANR). Ideally, an engineer would identify failure sessions with respective abnormal Performance Management (PM) events based on 3GPP standard procedures and perform correlation with users’ radio channel conditions. If there is no correlation with coverage issue or specific users, the messages leading to abnormal events and the system configuration within could be reviewed and verified with baseline system configuration, previously made changes to the site or comparison with a successful session to identify the underlying root cause.

Currently, all these steps are entirely manual. There are no tools or well-established methods to organize the configuration details within multiple messages of cell traces, such that these could be compared against a network level configuration baseline or with the IE details of a successful session to identify the accurate potential failure reason. The failure analysis is highly dependent on engineer’s past relevant experience and deep system knowledge. From past experience, it has been noted that this task could take a lot more than few hours depending upon the number of failures in the network under trace analysis, the duration of cell trace and adequate radio access network (RAN) system knowledge of the optimization engineer. Thus, resolution of these problematic cells could be delayed with longer response time and could also result in no possible outcomes of failure reasons or corrective optimization actions.

In this highlighted example of degradation in context setup success rate, the observed failures were rectified after changing the value of the configuration ‘allowedMeasBandwidth’. However, this configuration was not changed during the feature implementation, i.e., the value of the configuration was same before and after implementation of ANR. There was no observed correlation of the failures with poor radio conditions and specific users. Also, no alarms were generated in the concerned network nodes during the observed period of network failure.

The example shown in Figure 3B highlights some of the problems with rule-based methods of failure analysis. The configuration identified as the cause (‘allowedMeasBandwidth’) has no explicit association with the implemented feature (ANR) so would not likely be considered as point of investigation by domain experts. For similar reasons, this configuration is also not used as a feature in the methods described in the prior-art and cause of such failures will remain unanswered. Classification models with static set of features as input can only classify the problems with symptoms for which the relation between the symptoms and the problems are already well known and well established.

Traditional configuration audit and inconsistency checks, to identify any deviation with respect to the recommended set of values, is restricted to a limited set of major configurations, and not for entire set of network configurations. Any time-based correlation of the performance degradation with any change of configuration will be unable to detect such cause of failures as the configuration under discussion was never changed during the period of observed network failure.

Failure analysis techniques need to consider the possibility of other configurations in the network which demonstrate a similar pattern as the example shown in Figure 3B but may not have any influence on the implemented feature or the procedure that fails. It is important to isolate the “most important” configuration(s) from the plurality of configurations for efficient evaluation of potential influence of the same on a failure event under investigation.

In short, the current methods of failure analysis are limited in accuracy, do not always lead to appropriate corrective action and result in considerable “unknown causes” that require further investigation.

One aspect of the disclosure addresses these challenges by providing a method which first isolates the “most important” configurations, in terms of influence on a network failure under investigation, from a plurality of configurations exchanged through signaling of both normal and abnormal sessions. The plurality of configurations dynamically identified are reduced with a scoring mechanism followed by a presence or configuration value pattern correlation-based dimensionality reduction to identify most influential configurations. The isolated configurations combined with the predefined set of features are used to generate a dynamic probabilistic causal model which is later used to construct a Bayesian Network model. The Bayesian network with dynamic structure, for one or multiple failure causes, thus generated is used to identify the most appropriate reason of the network failure by means of causal inference. Figure 4 illustrates an overall procedure 100 for failure analysis according to an embodiment. The procedure may be carried out by a fault detection system, which may be embodied in a computing system or network node. The processing as described herein may be centralized in a single computing device or may be distributed among multiple computing devices. For purposes of illustration, the following description uses the failure example shown in Figure3B.

It is assumed that network traces have already been captured by a group of network nodes for which failure analysis is performed. The fault detection system initially generates session records from the network traces collected by the group of network nodes (105).

The fault detection system prepares a list, denoted L_reason, unique, from the session records. (110). The list comprises the unique set of the termination causes represented in the session records. The termination cause can be either normal or one of many possible abnormal reasons. The session records further include a field called session category that simply indicates whether a session is abnormal.

For each abnormal termination cause, a group of k_abnomau number of abnormal sessions and a group of k_n0rmai,i number of normal sessions are selected randomly from the session records (115). Each normal and abnormal groups of session records thus created consist of N unique session records where:

where n = count of unique abnormal termination causes, and i=0,1 ,2, ..., (n-1). The selected groups of normal and abnormal sessions associated with the termination cause form a dataset from which the dynamic features will be extracted as herein after described.

The N session records randomly selected are passed through three parallel processes, denoted A, B and C respectively that generate three groups of features used in generating a dynamic causal model for failure analysis. As described in more detail below, the dynamic causal model is a causal model for a termination clause that includes one or more dynamic features as described below. The dynamic causal model may additionally include static features based, for example, on channel conditions or user location, with known relationships to the termination causes. Process A identifies or extracts a static set of features reflecting prior knowledge of possible causes of failures (120). Process B identifies or extracts a static set off location features based on location information (125). Process C identifies or extracts a dynamic set of features based on observed unique sequences of signaling messages per session (130). During this dynamic feature extraction process, a point of failure is identified and the most influential configurations are identified and extracted based on the point of failure. Each of these processes will be described in more detail in the following description.

A dynamic causal model with dynamic features is prepared based on the three sets of features (135). A Bayesian network is formed using this casual model (140). Then a probabilistic causal inference analysis is performed over the Bayesian network to identify the possible causes of the network failure (145).

Figure 5 illustrates an exemplary procedure 150 that can be employed in process A to identify static features based on prior knowledge of possible causes of failures. A pre defined set of features are either extracted from the session records or generated by a custom function for the N number of session records (155). These features can include characteristics of the communication channel used for the communication. The user identity is included in this feature set. This set of features mainly represents the prior knowledge of possible issues and is based on the known dependency of any one or multiple termination causes on one or multiple features. Table 1 below describes a possible set of static features, not limited to but containing the Serving Temporary Mobile Subscriber Identity (S_TMSI), the average Chanel Quality Index (CQI), the last reported RSRP, the Timing Advance (TA), the SINR of the Physical Uplink Shared Channel (PUSCH) and Dominance . The extracted set of features is discretized into suitable ranges based on domain knowledge (160).

Table 1 : Example of pre-defined set of features extracted in process A, representing knowledge on the possible known issues

Figure 6A illustrates an exemplary procedure 175 that can be employed in process B to identify static location features based on location information of users associated with each session in the dataset comprising N number of session records. The location (i.e., latitude and longitude) of the sessions or the users can be determined by Minimization of Drive Tests (MDT) as specified by 3GPP, or by means of well-known triangulation methodology. In the exemplary embodiment herein described, there are two main location- based features: location type and non line of sight (NLOS) probability. These features are shown in Table 2 below. The location type feature indicates whether the location is in building, on a residential route, on a major route etc. and helps to characterize the failure sessions in terms of location category. The NLOS is a possibly important factor that represents the possibility of a sudden signal loss or signal fluctuation resulting in abnormal behavior of a communication and followed by a failure or abnormal termination.

Table 2: Example of pre-defined set of features extracted in process A, representing knowledge on the possible known issues

Referring to Figure 6A, the failure analysis system determines a location of all the sessions which are part of the dataset (180). The static location features are then extracted from each session based on the locations of the sessions (185). This set of location-based features also represent the prior knowledge of possible issues and is based on the known dependency of any one or multiple of termination causes on one or multiple features. Finally, the extracted set of location features are discretized or categorized (190). Figure 6B illustrates one realization of the NLOS probability. The NLOS probability is is given by:

NLOS_PROBABILITY = 0.4^*(h_b,avg/h_bs) + 0.4^*(1-(d_n / d_bs)) + 0.2^*B_nOrm Eq. (2) where h _,avg is the average height of building in the proximity of the user location, C_b is the number or count of buildings in the proximity of the UE, h _s is the base station height, d_nb is the distance to the nearest building, H_nb is the height of the nearest building, dbs is the distance from the base station and B_n0rm = If C_b >5 then 1/d_bs else C_b/(5^*d_bs). The variables used in the computation of the NLOS probability are illustrated in Figure 6B.

Figure 7 illustrates an exemplary procedure 200 for extracting and identifying the dynamic features used in the dynamic causal model. The procedure begins with analyzing and categorizing the signaling message sequences in each of the sessions represented in the dataset of N sessions. All unique sequences of signal messages observed in the data set are identified (205). A category is defined for each unique sequence of signaling messages (210). The categories of the unique sequences of signaling messages are attached to or associated with the session records as a categorical feature (215). These unique sequences of signaling messages are used as a feature representing an observed signaling sequence pertaining to a termination cause.

Fig. 8 pictorially depicts the categorization of unique sequences of signaling messages. The category of the sequence is based on the unique observed sequence of the signaling messages. In Figure 8, the columns with name “Sig. Msg. k”, where k=1 ,2,...,j and j represents the kth message within the reference sequence of messages which is created based on a unique list of all the observed signaling messages for all the sessions records within the N number of session records. The sequence of these messages are determined by the timing of the messages in a given session. Value 1 in the sequence denotes that the signaling message was observed and value 0 denotes that the signaling message was not observed for a given session. The sequence of 1 and 0 under the columns with name “Sig. Msg. k”, where k=1 ,2. j in the Fig. 8, are for illustration purposes only.

Fig. 9 shows an example of the actual outcome of the sequence categorization process. It is an aggregated table over the termination cause and the termination category. This attribute of the sessions, based on the unique sequence of the signaling messages, is a feature denoted herein by the name “EVENT_SEQUENCE.”

After the identification of the unique signaling message sequences and categorization of the sessions based on the same, a dynamic feature extraction process is performed for each termination cause (220-255). A first point of failure is identified for each of the abnormal category of the sessions per termination cause (225). The first point of failure is defined as any one of the following, whichever is observed last, based on the timing in the signaling messages in the sequence: i. The last successfully transmitted signaling message, by the first network node, with missing response, according to protocol, from the UE or the target second network node towards which the last signaling message was transmitted. ii. The last successfully received signaling message, by the first network node, with missing response, according to protocol, sent towards the UE or the target second network node from which the last signaling message was received. iii. The last received failure (negative) response from the UE or the other target second network node, according to protocol, towards which the last signaling message was transmitted successfully by the first network node. iv. The last sent failure (negative) response towards the UE or the target second network node, according to protocol, from which the last signaling message was received successfully by the first network node.

The identification of the first point of failure is achieved by comparison of the unique sequence of the failed sessions with a configured reference file containing the expected sequence of signaling messages. These groups are determined by the knowledge of paired request/response signaling messages to/from a first network node with the expected response of the same, from/to a UE or a second network node, as defined by the protocol. Fig. 10 illustrates one representative example of the structure of the reference file.

Referring back to Figure 7, based on the first point of failure, the last signaling message successfully sent or received by the network node is identified (230). Figure 11 illustrates one example showing how the first point of failure and the last successful signaling message are identified. In this example the termination cause is “failure in radio interface procedure” captured during the initial context setup. Figure 11 illustrates 1 abnormal session on the left and three normal sessions. In the abnormal session, the RRC- Connection_Reconfiguration_Complete message was not received by the network node.

This is the first point of failure. The last message successfully transmitted message is the RRC_Connection_Reconfiguration message. The last successful signaling message, identified in the last step, helps to identify any anomaly in the information elements (lEs) based on the observability (i.e., presence and absence) of the IE or the value of the IE, when compared between the group of abnormal (i.e., kabnormau number of sessions) and the corresponding normal sessions (i.e., k_n0rmai,i number of sessions) for a given abnormal termination cause. First, all the lEs, associated containers or path and values are extracted from the signaling message identified in the last step, for all the sessions (i.e., N number of sessions) (235). Signaling messages exchanged between the network nodes or between the UE and the network nodes comprise various lEs and when parsed can be visualized, stored or processed in form of a Extended Markup Language (XML) or JavaScript Object Notation (JSON) like structure where each IE has a specific value (or set of values), attribute (or set of attributes) and parent.

Figure 12 shows one example of a parsed external message detail for the Radio Resource Control (RRC) protocol. Due to the structure of the message, it is important to identify an IE based on its parent and full path because an IE with same name can be part of a message detail with a different path and used by the network for a different purpose. A path can be defined as full path, starting from the root, of an element (i.e., an IE) in an XML or JSON like file structure. Figure 13A below is an illustration of the path of the IE “Ite- rrc.carrierFreq.”

A unique identifier is generated for each of the IE based on the IE and the path of the IE (240). In one embodiment, the unique identifier comprises three indices, x1 , x2 and x3, where x1 is a cause index of the failure or abnormal termination cause, x2 is a path index of the path of the IE in a list that contains all the unique paths, and x3 is an IE index of the IE in a list that contains all unique lEs. Figure 13B is an example of the naming convention for the lEs in each unique signaling message. In this example, the Ite-rrc.carrierFreq IE is identified a E0P60_C13, where E0 is the index of the termination cause, P60 is the path index and C13 is the IE index. As noted above, the path index and IE index are necessary to uniquely identify an IE in a signaling message. A significance score is generated for each unique identifier based on the observability and value of the IE in both the abnormal and normal sessions in the dataset (245). The significance score indicates the significance of the identifier. The significance score is given by:

If the signaling message is observed for both groups, S_{0 .} is given by:

Else, S_{0 i} = 0. The term c_{ab o} is the count of abnormal sessions in K_{abnormal i} observed with the IE present and c_{n o}is the count of normal sessions in K_{normal i} .

If the signaling message is observed for both groups, S_v is given by:

S_{v i} = abs(^c _lv,c_{2 v}) where,

Else, S_{v .} = 0. The term

is the count of abnormal sessions in K ,

observed with the IE present and c_{n v} is the count of normal sessions in K_{normal i} . The significance score, s_sig , indicates either a) the significance of presence/absence of the IE in the signaling message or b) the significance of presence of the IE in the signaling message with a given value. Higher the value of the score, higher the significance of the IE. The significance score is compared to a significance threshold, denoted S_thresh, to identify the most significant lEs and associated identifiers (250). The significance threshold defines a lower limit for significance to remove less significant lEs. The set of identifiers meeting the threshold requirement comprises the dynamic set of features from which a part of the dynamic causal model is generated as described below.

Figure 14 is one example of significance scores for a set of lEs and filtering based on a significance threshold. In this example, Sthresh is the 80th percentile of S_Sig, «abnormal, i = 100 and K_normau = 200. In th is example, 8 unique identifiers meet the threshold requirement.

The dynamic feature extraction process (220- 250) as herein described is repeated for each termination cause. Processing exits the dynamic feature extraction process once the last termination cause is processed (255).

The group of dynamic features extracted from the session records after iterating over all the termination causes are then reduced to a smaller group based on the unique pattern of 0s and 1s of the dynamic features (260). While the dynamic feature set is reduced, a dynamic feature mapping is generated showing the relationship of the dynamic features in the reduced set with all other dynamic features. This dynamic feature mapping is used during the inference process for determining possible causes of a network failure.

Fig.15 shows an example of such dimensionality reduction based on the high correlation of the dynamic features and mapping of the unique dynamic features in the reduced set with all other dynamic features. In this example three identifiers (associated with the lEs) with unique patterns are used as the dynamic features:

1 ) E0P18_C77 (Ite-rrc.gapOffset)

2) E0P60_C13 (Ite-rrc.carrierFreq )

3) E0P28_C38 (Ite-rrc.triggerQuantity) These dynamic features are selected based on the unique patterns of 1s and Os in the left side of Figure 15. The right side of Figure 15 illustrates the relationship mapping of the dynamic features based on the unique patterns of 1s and Os.

A dynamic causal model is prepared using the three sets of features generated as described above (Fig. 4., 135). Those feature sets include: i. Static set of features reflecting prior knowledge of possible causes of failures ii. Static set of features based on location information iii. Dynamic set of features - based on observed unique sequence of signaling messages per session and the most influential configuration(s) extracted from first point of failure.

A Bayesian network is formed using the dynamic causal model prepared in the previous step (Fig. 4, 140). The cause of failure is then determined by causal inference over the Bayesian network (Fig. 4, 145)

Figure 16 shows a generic structure of the dynamic causal model according to an embodiment.

Figure 17 is one example of a dynamic causal model used in analyzing the initial context setup failure with cause “failure in radio interface procedure”.

The dynamic causal model as shown in Figures 16 and 17 is used to determine one or more possible causes of a network failure. To infer the appropriate cause of the failure, interventional and marginal queries are used on the model and a score is calculated for each of the features which are part of the dynamic causal model. This score is calculated based on the following formula:

where

X represents a feature Yfaii represents a failure with a given cause

Xiow represents the discritized value or range of X which is considered as unaccept and abnormal according to the domain experts

Xhigh represents the discritized value or range of X which is considered as accepta and normal according to the domain experts

In the above formula, the first factor P(Xi₀ ) represent the probability of observing feature X within the unacceptable or abnormal range. The second factor provides an estimate of the achievable reduction in Yf_aii, i.e. the reduction in the failure with a given cause with respect to following two scenarios:

I. If the feature X assume values of Xlow, keeping values of all other features unchanged

II. If the feature X assume values of Xhigh, keeping values of all other features unchanged

Using the derived inference score as reference, a priority is set for the features.

Higher the priority of the score, higher the probability that the feature is the cause of the failure. As an example, Figure 18A illustrates the inference method for initial context setup failure with cause “failure in radio interface procedure”. In this example, the top 3 possible causes of the failure, which are further explained according to the domain knowledge, are:

1 . A specific event sequence, i.e., the sequence of events with missing RRC_CONNECTION_RECONFIGURATION_COMPLETE causes the failure.

2. The lEs represented by and mapped to E0P18_C77 and E0P60_C13 in Fig.

18B and Fig. 18C respectively.

The lEs represented by and mapped to E0P18_C77 are system constants. Based on domain knowledge, these are not likely the cause of the network failure.

The lEs represented by and mapped to E0P60_C13 comprise configurable parameters. These parameters can be modified, according to the feasibility and domain knowledge, as part of corrective action. Note that in this example, the feature with the unique identifier E0P60_C15 is included in the list of possible causes. Adjustment of this parameter resulted in the improvement of the context setup success rate as shown in Figure

3B. Although one approach to causal inference has been described for purposes of illustration, those skilled in the art will appreciate that many different approaches for performing causal inference are available to determine the cause and effect from the dynamic causal model.. Based on application and viability different methods can be applied. Following are the broader approaches to solve these problems of finding out causality:

1 . Probabilistic Causality - These methods involve the causal graphs and the probability distributions which come together to be called causal Bayes Net or Bayesian Network. These methods tend to draw a probabilistic relationship between two defined events.

2. Counterfactual Based Causality - This method mainly deals with the ‘what-if factors and infers the causality from intervening the control tests. It is impossible by definition and practicality to observe the effect of two contradicting factors in a control test, interventions by the means of Counterfactual highlights the level of causality with the effect.

3. Structural Equation Models - Structural Equation Models or SEMs are form of statistical analysis technique. This mechanism is a combination of factor analysis and multiple regression analysis which are used to find relations between measured variables and the latent constructs.

4. Deep Learning Models - Using deep neural net architectures, and advancements in these algorithms have led to the use of such methods in various fields including causal inference as well. The output of each intermediate layers or the hidden layers can be used in various forms to extract representation of the input measures and analyzing its similarity to measure causalities. Also, networks like Attention-based Deep Convolutional Neural Net is applied to find out the causal relationships in time series data.

In many cases, the probabilistic causality approach is preferred because there may not be a direct cause and effect relationship that can be developed. Using the probabilistic causality approach, there may still be high degree of relationship among them. This is because of the latent events in between the cause and effect which may not be previously known or found by other models. Counterfactual based causality and structural equation models both deal in some form of calculating the intervention-based causality. Neural networks also demand more granular time series data with usual problems of parametric optimization.

The dynamic causal model as described herein enables failure analysis to be performed for both failures with well-known causes and failures with unknown causes not previously identified.

The dynamic set of features extracted from signaling data based on failure cause codes observed in session records, on top of a predefined set of features is used to construct the dynamic causal model which is dynamic in structure. This dynamic causal model provides explanation to a failure signature from a network optimization perspective and extends the possible outcomes to the unknown problems automatically by the system.

Isolation of the most influential configurations exchanged through signaling allows creation of a dynamic set of features with definite causal relation. These dynamic features helps to explore the reasons for failures previously placed under “Unknown classes”. Drastic reduction in the overall number of features leads to reduction in computational complexity associated with creating a model with large number of dynamic features possible with signaling data. Proposed solution endorses practical realization of a cost effective and efficient system for network failure analysis of unknown problems in a mobile communication network.

The dynamic causal model used in the proposed method is combination of two structures. The first structure is based on static set of features and the second part is completely dynamic, based on most influential features. The static structure allows to include the problems which are already known and established. The dynamic part of the model allows to identify newer problems. Thus, the proposed method helps to realize a unified effective system capable of performing comprehensive failure analysis of both known and newer-unknown problems in a mobile communication network on its own.

In comparison to other models mentioned like Random forest, decision trees, SVM etc., the proposed method is independent of the data labelled with known problems. Both the normal/successful and abnormal/failed sessions are identified dynamically from the signaling data based on the cause available from external messages and according to the protocol defined by 3GPP are used. Further, the application of causal inference on a Bayesian Network model, that is created using static features and dynamic set of most influential features along with their causal relation, yields the most appropriate cause of the failure. The resultant accurate failure analysis and problem identification would lead to appropriate corrective action.

Figure 19 illustrates a computer-implemented exemplary method 300 of determining a cause of a network failure in a mobile communication network in a mobile communication network. The method comprises obtaining session records for a plurality of network nodes in the mobile communication network (310) and creating, from the session records, a list of sessions including, for each session, a termination cause and an indication whether the termination cause is normal or abnormal (320). The method 300 further comprises, for at least one abnormal termination cause, creating a dataset including a subset of the session records for a plurality of abnormal sessions and a plurality of normal sessions (330), identifying and categorizing unique sequences of signaling messages in the dataset (340), performing signaling sequence analysis for each unique sequence of signal messages to identify a first point of failure and a last successfully transmitted or received signaling message (350) and extracting a dynamic feature set from the last successfully transmitted or received signaling message (360). The method 300 further comprises generating a dynamic causal model based on the dynamic feature set (370) and determining a cause of failure for the abnormal termination cause using the dynamic causal model (380).

In some embodiments, the cause of failure is determined by generating a Bayesian network based on the dynamic causal model and determining a cause of failure by probabilistic causal inference over the Bayesian network.

In some embodiments of the method 300, creating a dataset including subset of the session records for a plurality of abnormal sessions and a plurality of normal sessions comprises randomly selecting k abnormal sessions associated with the abnormal termination cause and randomly selecting k normal sessions.

In some embodiments of the method 300, identifying and categorizing unique sequences of signaling messages in the dataset comprises, for each unique signaling sequence, determining a sequence of signaling messages based on a reference sequence and generating a signature comprising a sequence of bit values, wherein each bit value indicates whether a corresponding one of the signaling messages in the sequence of signaling messages is present.

In some embodiments of the method 300, identifying a first point of failure for one of the unique sequences of signal messages comprises detecting one of a missing response message from a wireless device to a network node, a missing response message from the network node to the wireless device, a response message from the network node to the wireless device with a failure indication or a response message from the wireless device to the network node with a failure indication.

In some embodiments of the method 300, identifying the last successfully transmitted or received signaling message comprises identifying one of the last signaling message successfully transmitted by the network node before the missing response message from the wireless device, the last signaling message successfully transmitted by the wireless device before the missing response message from the network node, the last signaling message successfully transmitted by the network node before the response message from the network node with a failure indication or the last signaling message successfully transmitted by the wireless device before the response message from the wireless device with a failure indication.

In some embodiments of the method 300, extracting the dynamic feature set from the last successfully transmitted or received signaling message comprises identifying a set of information elements associated with the last successfully transmitted signaling message, for each of the information elements, generating a unique identifier for the information element and computing a significance score, ranking the information elements based on the significance scores and selecting the dynamic feature set based on the ranking.

In some embodiments of the method 300, computing a significance score of one of the information elements comprises iterating over the session records to determine, for each session record, whether the information element is present in the session record and, if present, a value of the information element, generating an aggregate observability score based on the presence or absence of the information element in the session records, generating an aggregate value score based on the values of the information element in the session records and computing a significance score based on the aggregate observability score and aggregate value score.

Some embodiments of the method 300 further comprise selecting a reduced dynamic feature set representing less than all the information element sin the dynamic feature set and generating a mapping of the dynamic features based on correlation between the dynamic features in the reduced dynamic feature set and the other dynamic features in the dynamic feature set.

In some embodiments of the method 300, determining a cause of failure based on probabilistic causal inference performed over the Bayesian network comprises computing the inference scored based on a probability of observing a particular feature within an abnormal range and an estimate of an achievable reduction in a particular termination cause.

In some embodiments of the method 300, the estimate of an achievable reduction in a particular termination cause is obtained based on a difference between a conditional probability of the termination cause assuming that the particular feature is in an abnormal range, and a conditional probability of the termination cause assuming that the particular feature is in a normal range.

Some embodiments of the method 300 further comprise extracting, from the session records, a set of one or more pre-defined static features with a known relationship to a termination cause and including the predefined set of static features in the dynamic causal model for the related termination cause.

In some embodiments of the method 300, the pre-defined static features include one or more channel characterization features indicative of channel conditions.

Some embodiments of the method 300 further comprise extracting, from the session records, a set of one or more pre-defined location features based on user location and with a known relationship to a termination cause and including the predefined set of location features in the dynamic causal model for the related termination cause.

In some embodiments of the method 300, the set of pre-defined location features includes at least one of a location type and a non line-of-sight (NLOS) probability.

Figure 20 illustrates a computing system 400 for performing failure analysis in a mobile communication network. The computing system 400comprises an obtaining unit 410, a list creation unit 420, a dataset creation unit 430, an identifying unit 440, a sequence analysis unit 450, an extracting unit 460, a modeling unit 470, an optional network creation unit 480 and a determining unit 490. The various units 410 -490 can be implemented by hardware and/or by software code that is executed by one or more processors or processing circuits. The obtaining unit 410 is configured to obtain session records for a plurality of network nodes in the mobile communication network. The list creation unit 420 is configured to create, from the session records, a list of sessions including, for each session, a termination cause and an indication whether the termination cause is normal or abnormal. The dataset creation unit 430 is configured to create a dataset including subset of the session records for a plurality of abnormal sessions and a plurality of normal sessions. The identifying unit 440 is configured to identify and categorize unique sequences of signaling messages in the dataset. The sequence analysis unit 450 is configured to perform signaling sequence analysis for each unique sequence of signal messages to identify a first point of failure and a last successfully transmitted or received signaling message. The extracting unit 460 is configured to extract a dynamic feature set from the last successfully transmitted or received signaling message. The modeling unit 470 is configured to generate a dynamic causal model based on the dynamic feature set. The network creation unit 480, if present, is configured to generate a Bayesian network based on the dynamic causal model. The determining unit 490 is configured to determine a cause of failure using the dynamic causal model. In one embodiment, the determining unit 490 is configured to determine a cause of failure for the abnormal termination cause using the dynamic causal model by generating a Bayesian network based on the dynamic causal model and performing probabilistic causal inference over the Bayesian network.

Figure 21 illustrates a computing system 500 configured to perform failure analysis in a mobile communication network as herein described. The computing system 500 may, for example, be embodied in a network node in a core network of the mobile communication network. In some embodiments, the functionality of the computing system 500 may be distributed over multiple network nodes or computing devices. The computing system 500 comprises communication circuitry 520, processing circuitry 530, and memory 590.

The communication circuitry 520 comprises a network interface for communicating with network nodes in the mobile communication network.

The processing circuitry 530 controls the overall operation of the computing system 500 and performs the methods as herein described. The processing circuit 530 may comprise one or more microprocessors, hardware, firmware, or a combination thereof.

Memory 540 comprises both volatile and non-volatile memory for storing computer program code and data needed by the processing circuitry 530 for operation. Memory 540 may comprise any tangible, non-transitory computer-readable storage medium for storing data including electronic, magnetic, optical, electromagnetic, or semiconductor data storage. Memory 540 stores a computer program 550 comprising executable instructions that configure the processing circuitry 530 to implement the failure detection methods as described herein. A computer program 550 in this regard may comprise one or more code modules corresponding to the means or units described above. In general, computer program instructions and configuration information are stored in a non-volatile memory, such as a ROM, erasable programmable read only memory (EPROM) or flash memory. Temporary data generated during operation may be stored in a volatile memory, such as a random access memory (RAM). In some embodiments, computer program 550 for configuring the processing circuitry 530 as herein described may be stored in a removable memory, such as a portable compact disc, portable digital video disc, or other removable media. The computer program 550 may also be embodied in a carrier such as an electronic signal, optical signal, radio signal, or computer readable storage medium.

Those skilled in the art will also appreciate that embodiments herein further include corresponding computer programs. A computer program comprises instructions which, when executed on at least one processor of an apparatus, cause the apparatus to carry out any of the respective processing described above. A computer program in this regard may comprise one or more code modules corresponding to the means or units described above.

Embodiments further include a carrier containing such a computer program. This carrier may comprise one of an electronic signal, optical signal, radio signal, or computer readable storage medium.

In this regard, embodiments herein also include a computer program product stored on a non-transitory computer readable (storage or recording) medium and comprising instructions that, when executed by a processor of an apparatus, cause the apparatus to perform as described above.

Embodiments further include a computer program product comprising program code portions for performing the steps of any of the embodiments herein when the computer program product is executed by a computing device. This computer program product may be stored on a computer readable recording medium.

Additional embodiments will now be described. At least some of these embodiments may be described as applicable in certain contexts and/or wireless network types for illustrative purposes, but the embodiments are similarly applicable in other contexts and/or wireless network types not explicitly described.

Claims

CLAIMS What is claimed is:

1 . A computer implemented method (300) of determining a cause of a network failure in a mobile communication network, the method (300) comprising: obtaining (310) session records for a plurality of network nodes in the mobile communication network; creating (320), from the session records, a list of sessions including, for each session, a termination cause and an indication whether the termination cause is normal or abnormal; for at least one abnormal termination cause: creating (330) a dataset including a subset of the session records for a plurality of abnormal sessions and a plurality of normal sessions; identifying and categorizing (340) unique sequences of signaling messages in the dataset; performing (350) signaling sequence analysis for each unique sequence of signal messages to identify a first point of failure and a last successfully transmitted or received signaling message; extracting (360) a dynamic feature set from the last successfully transmitted or received signaling message; generating (370) a dynamic causal model based on the dynamic feature set; determining (380) a cause of failure for the abnormal termination cause using the dynamic causal model.

2. The method (300) of claim 1 , wherein creating a dataset including subset of the session records for a plurality of abnormal sessions and a plurality of normal sessions comprises: randomly selecting k abnormal sessions associated with the abnormal termination cause; randomly selecting k normal sessions.

3. The method (300) of claim 1 or 2, wherein identifying and categorizing unique sequences of signaling messages in the dataset comprises, for each unique signaling sequence: determining a sequence of signaling messages based on a reference sequence; and generating a signature comprising a sequence of bit values, wherein each bit value indicates whether a corresponding one of the signaling messages in the sequence of signaling messages is present.

4. The method (300) of claim 3, wherein identifying a first point of failure for one of the unique sequences of signal messages comprises detecting one of: a missing response message from a wireless device to a network node; a missing response message from the network node to the wireless device; a response message from the network node to the wireless device with a failure indication; or a response message from the wireless device to the network node with a failure indication.

5. The method (300) of claim 4, wherein identifying the last successfully transmitted or received signaling message comprises identifying one of: the last signaling message successfully transmitted by the network node before the missing response message from the wireless device; the last signaling message successfully transmitted by the wireless device before the missing response message from the network node; the last signaling message successfully transmitted by the network node before the response message from the network node with a failure indication; the last signaling message successfully transmitted by the wireless device before the response message from the wireless device with a failure indication;

6. The method (300) of claim 5, extracting the dynamic feature set from the last successfully transmitted or received signaling message comprises: identifying a set of information elements associated with the last successfully transmitted signaling message; for each of the information elements, generating a unique identifier for the information element and computing a significance score; ranking the information elements based on the significance scores; and selecting the dynamic feature set based on the ranking.

7. The method (300) of claim 6 wherein computing a significance score of one of the information elements comprises: iterating over the session records to determine, for each session record, whether the information element is present in the session record and, if present, a value of the information element; generating an aggregate observability score based on the presence or absence of the information element in the session records; generating an aggregate value score based on the values of the information element in the session records; compute a significance score based on the aggregate observability score and aggregate value score.

8. The method (300) of claim 7, further comprising: selecting a reduced dynamic feature set representing less than all the information element sin the dynamic feature set; and generating a mapping of the dynamic features based on correlation between the dynamic features in the reduced dynamic feature set and the other dynamic features in the dynamic feature set.

9. The method (300) of any one of claim 1 - 8, wherein determining a cause of failure based on probabilistic causal inference performed over the Bayesian network comprises computing the inference scored based on: probability of observing a particular feature within an abnormal range; and an estimate of an achievable reduction in a particular termination cause.

10. The method (300) of claim 9 wherein the estimate of an achievable reduction in a particular termination cause is obtained based on a difference between: a conditional probability of the termination cause assuming that the particular feature is in an abnormal range; and a conditional probability of the termination cause assuming that the particular feature is in a normal range.

11 . The method (300) according to any one of claims 1 - 10, further comprising: extracting, from the session records, a set of one or more pre-defined static features with a known relationship to a termination cause; and generating the dynamic causal model further based on the predefined set of static features.

12. The method (300) of claim 11 , wherein the pre-defined static features include one or more channel characterization features indicative of channel conditions.

13. The method (300) according to any one of claims 1 - 12, further comprising: extracting, from the session records, a set of one or more pre-defined location features based on user location and with a known relationship to a termination cause; and generating the dynamic causal model further based on the predefined set of location features.

14. The method (300) according to claim 13, wherein the set of pre-defined location features includes at least one of a location type and a non line-of-sight (NLOS) probability.

15. The method (300) of any one of claims 1 - 14, determining a cause of failure for the abnormal termination cause using the dynamic causal model comprises generating a Bayesian network based on the dynamic causal model and performing probabilistic causal inference over the Bayesian network.

16. A computing system (400, 500) configured to determine a cause of a network failure in a mobile communication network, the computing system comprising: communication circuitry for communicating over a communication network with other network nodes in the mobile communication network; and processing circuitry operatively connected to the communication circuitry and configured to: obtain session records for a plurality of network nodes in the mobile communication network; create, from the session records, a list of sessions including, for each session, a termination cause and an indication whether the termination cause is normal or abnormal; for at least one abnormal termination cause: create a dataset including a subset of the session records for a plurality of abnormal sessions and a plurality of normal sessions; identify and categorize unique sequences of signaling messages in the dataset; perform signaling sequence analysis for each unique sequence of signal messages to identify a first point of failure and a last successfully transmitted or received signaling message; extract a dynamic feature set from the last successfully transmitted or received signaling message; generate a dynamic causal model based on the dynamic feature set; generate a Bayesian network based on the dynamic causal model; determine a cause of failure based on probabilistic causal inference performed over the Bayesian network.

17. The computing system (400, 500) according to claim 16, wherein the processing circuitry is further configured to perform the method according to any one of claims 2 - 15.

18. A computing system (400, 500) configured to determine a cause of a failure in a mobile communication network, the computing system being configured to: obtain session records for a plurality of network nodes in the mobile communication network; create, from the session records, a list of sessions including, for each session, a termination cause and an indication whether the termination cause is normal or abnormal; for at least one abnormal termination cause: create a dataset including a subset of the session records for a plurality of abnormal sessions and a plurality of normal sessions; identify and categorize unique sequences of signaling messages in the dataset; perform signaling sequence analysis for each unique sequence of signal messages to identify a first point of failure and a last successfully transmitted or received signaling message; extract a dynamic feature set from the last successfully transmitted or received signaling message; generate a dynamic causal model based on the dynamic feature set; generate a Bayesian network based on the dynamic causal model; determine a cause of failure based on probabilistic causal inference performed over the Bayesian network.

19. The computing system (400, 500) according to claim 18, wherein the processing circuitry is further configured to perform the method according to any one of claims 2 - 15.

20. The computing system (400, 500) of any one of claims 16 - 19 wherein the computing system comprises a network node in the mobile communication network.

21. A computer program (550) comprising executable instructions that, when executed by a processing circuitry (530) in a computing system (500), causes the computing system (500) to perform any one of the methods of embodiments 1 - 15.

22. A carrier containing a computer program (550) of claim 21 , wherein the carrier is one of an electronic signal, optical signal, radio signal, or computer readable storage medium.