WO2023154051A1

WO2023154051A1 - Determining root causes of anomalies in services

Info

Publication number: WO2023154051A1
Application number: PCT/US2022/016062
Authority: WO
Inventors: Anirban Basu; Smit ZAVERI
Original assignee: Hitachi Vantara LLC
Current assignee: Hitachi Vantara LLC
Priority date: 2022-02-11
Filing date: 2022-02-11
Publication date: 2023-08-17
Anticipated expiration: 2024-08-11

Abstract

In some examples, a computing device may receive sampled metric data from a services system that includes a plurality of service computing devices configured to provide a plurality of services. The computing device may generate time series data based on the sampled metric data to detect an anomaly based on the sampled metric data. The computing device may employ clustering based at least on detecting the anomaly based on the sampled metric data to determine a subset of services affected by the anomaly. In addition, the computing device may apply at least one ranking algorithm to the subset of services affected by the anomaly for determining one or more services of the subset of services as a root cause of the anomaly.

Description

DETERMINING ROOT CAUSES OF ANOMALIES IN SERVICES

TECHNICAL FIELD

[0001] This disclosure relates to the technical field of detecting anomalies in services.

BACKGROUND

[0002] Computer services platforms may provide the ability to scale services up and down according to varying requirements of service customers. In some cases, these services platforms may employ a plurality of microservice applications that may provide services to other services on the platform and/or to the service customers. The emergence of containerized applications has enabled easier abstraction of microservice applications, modularity for implementation, reusability, and independent scaling of application development. However, the high complexity of interrelationships and interactions between containerized microservice applications within a microservice architecture can make anomaly detection and diagnosis challenging. For instance, operators of microservice architectures may periodically collect information to monitor and troubleshoot service performance and reliability problems. Nevertheless, localizing the source (i.e., root cause) of an anomaly in a large-scale microservice architecture including a plurality of microservice applications can be difficult.

SUMMARY

[0003] In some implementations, a computing device may receive sampled metric data from a services system that includes a plurality of service computing devices configured to provide a plurality of services. The computing device may generate time series data based on the sampled metric data to detect an anomaly based on the sampled metric data. The computing device may employ clustering based at least on detecting the anomaly based on the sampled metric data to determine a subset of services affected by the anomaly. In addition, the computing device may apply at least one ranking algorithm to the subset of services affected by the anomaly for determining one or more services of the subset of services as a root cause of the anomaly. BRIEF DESCRIPTION OF THE DRAWINGS

[0004] The detailed description is set forth with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items or features.

[0005] FIG. 1 illustrates an example architecture of a system configured to determine a root cause of an anomaly detected in a service system according to some implementations.

[0006] FIG. 2 is a flow diagram illustrating an example process that may be executed for determining a root cause of an anomaly according to some implementations.

[0007] FIG. 3 illustrates an example data structure including example time series data determined from metric data according to some implementations.

[0008] FIG. 4 illustrates an example process showing a relationship between the clustering and ranking operations according to some implementations.

[0009] FIG. 5 illustrates an example of a process for performing clustering of services for determined anomalies according to some implementations.

[0010] FIG. 6 illustrates an example of interdependencies between services in a cluster according to some implementations.

[0011] FIG. 7 illustrates an example causal graph according to some implementations.

[0012] FIG. 8 illustrates select example components of one or more analysis computing devices that may be used to implement some of the functionality of the systems described herein. [0013] FIG. 9 illustrates select example components of an administrative device according to some implementations.

DESCRIPTION OF THE EMBODIMENTS

[0014] Some implementations herein are directed to techniques and arrangements for determining root causes of anomalies in services systems that include a plurality of services. For instance, some examples herein may be applied in a services system that includes a plurality of containers, with each container including a respective containerized application and one or more libraries for executing the containerized application. Some examples herein provide a system that analyzes availability, latency, workload, memory, processing usage, and the like, of services for localizing one or more root causes of one or more anomalies in the services. The system may group potentially anomalous services into clusters and may generate respective graphs based on individual clusters. The system may employ a graph-based centrality measure-based approach to efficiently search for the root causes within the respective clusters and may apply one or more ranking algorithms for selecting a likely root cause of an anomaly. Accordingly, examples herein can take into account the relationships between services when identifying a particular service as the likely root cause of an anomaly.

[0015] The examples herein solve the problem of accurately determining root causes of anomalies in microservice-based platforms where application monitoring and telemetry is available. The approach herein localizes multiple root causes (if applicable) in a microservices system, and provides accurate information regarding the root causes to help a user immediately identify one or more services to investigate further. Additionally, the examples herein may employ signals typically available in microservice systems such as latency, traffic (e.g., workload), error information (e.g., container will not start error message, resource cannot be found error message, etc.), and saturation (e.g., CPU usage, memory usage, etc.), without requiring large amounts of historical data. The solution set forth herein is generic and robust, may easily be extended to new platforms, and is able to detect one or more root causes of one or more anomalies with high accuracy.

[0016] For discussion purposes, some example implementations are described in the environment of an analysis computing device that monitors a services system for detecting anomalies and determining root causes of the detected anomalies. However, implementations herein are not limited to the particular examples provided, and may be extended to other types of computing system architectures, other types of service systems configurations, other types of services, other types of clustering algorithms, other types of ranking algorithms, and so forth, as will be apparent to those of skill in the art in light of the disclosure herein.

[0017] FIG. 1 illustrates an example architecture of a system 100 configured to determine a root cause of an anomaly detected in a service system according to some implementations. The system 100 includes one or more analysis computing devices 102 that are able to communicate with, or otherwise coupled to a plurality of service computing devices 104, such as through one or more networks 106. In some examples, the service computing devices 104 form a services system 105 that is also able to communicate over the network(s) 106 with one or more client devices 108. For instance, the service computing devices 104 may provide the client devices 108 with one or more services 109, as discussed additionally below.

[0018] The service computing devices 104 and analysis computing device(s) 102 may also be able to communicate over the network(s) 106 with at least one administrative device 110. For example, the administrative device 110 may be used for configuring the service computing devices 104 and/or the analysis computing device(s) 102, such as for receiving root cause information and taking remedial action for correcting detected anomalies, as discussed additionally below. The client device(s) 108 and the administrative device(s) 110 may be any of various types of computing devices, as discussed additionally below. A client user 112 may be associated with a respective client device 108 such as through a client application 114 or the like. Similarly, an administrative user 120 may be associated with a respective administrative device 110 such as through an administrative application 122.

[0019] In the example of FIG. 1, the service computing devices 104 include a management computing device 124 and a plurality of worker computing devices 126(1)- 126(N) that together make up the services system 105, which may also be referred to as a microservices system, for providing the services 109 to the client devices 108. In this example, the management computing device 124 includes a control program 128, an application programming interface (API) server 130, a scheduling program 132, and configuration data 134.

[0020] In addition, each worker computing device 126(1)- 126(N) may also be referred to as a “worker node”, and includes a respective node management application 136(1)-136(N) and a respective routing application 138(1)-138(N). Further, each worker computing device 102 may include at least one container group 140 (abbreviated as “CG” in FIG. 1). Each container group 140 may include one or more containers 142. In the illustrated example, the first worker computing device 102(1) includes at least a first container group 140(1 a) and a second container group 140(lb), each of which includes at least one respective container 142. Similarly, the Nth worker computing device 102(N) includes at least a first container group 140(Na) and a second container group 140(Nb), each of which includes at least one respective container 142. As mentioned above, each respective container 142 may include a respective containerized application (e.g., executable code) configured to provide a respective service. One example of a services system for providing and managing a large number of containerized applications as respective services is a KUBERNETES® system, which is available from the Cloud Native Computing Foundation of San Francisco, California, USA; however, implementations herein are not limited to use with KUBERNETES® systems.

[0021] The management computing device 124 serves as the primary controlling unit of the service computing devices 104, and may manage the workload and direct communications across the services system 105. In some examples, the configuration data 134 may include the configuration data for the service computing devices 104, which may represent the overall state of the services system 105 at any given point in time. Furthermore, the API server 130 processes and validates REST API requests and updates the state of API objects in the configuration data 134 to enable configuration of workloads and containers 142 across the worker nodes 126. Additionally, the scheduling program 132 may track the resource usage on each respective worker node 126 to ensure that the scheduled respective workload does not exceed the available resources on the respective worker node 126. Accordingly, the scheduling program 132 may be configured to match the workload demand with the available resources of the worker nodes 126. In addition, the control program 128 may be executed to motivate the services system 105 toward a desired state, such as by communicating with the API server 130 to create, update and/or delete resources of the worker nodes.

[0022] The worker computing devices 126 may each include a respective instance of the node management application 134 that manages starting, stopping, and maintaining the application containers 142 organized into the respective container groups 140, and as directed by the management computing device 124. In some examples, each worker computing device 126 may relay its own status periodically (e.g., every second, every couple of seconds, etc.) to the management computing device 124, such as via a heartbeat message. If the management computing device 124 detects failure of a worker computing device 126, such as due to a missing heartbeat, the management computing device 124 may relaunch the container group(s) 140 from the failed worker computing device 126 on one or more different worker computing devices 126. [0023] The container groups 140 may each contain one or more containers 142 that are colocated on the same worker computing device 126. As one non-limiting example, each container group 140 be assigned a unique IP address within the services system 105, which may allow applications to use ports without the risk of conflicts. Within a respective container group 140, all the containers 142 within the same container group 140 may be able to reference each other. In this example, for a first container 142 within a first container group 140 to access a second container 142 within a second container group 140, the first container may use the IP address of the second container group 140.

[0024] An individual container 142 may contain all of the packages for running a respective service. Accordingly, a container 142 may be the lowest level of a service, and may contain the executable application, libraries used by the executable application, and their dependencies. The containers 142 may be lightweight, standalone, executable software packages that include everything needed to run an application, i.e., executable code, runtime, system tools, system libraries, and settings. Multiple containers 142 can run on the same worker computing device 126 and share an OS kernel with other containers 142, each running as isolated processes in user space. Selected ones of the containers 142 may be executed, and respective containers 142 may call other respective containers 142, as needed, to provide the services 109 to the client devices 108.

[0025] In the illustrated example, the analysis computing device(s) 102 may execute a root cause analysis program 150, a scheduling program 152, and a data analysis program 154. In addition, the analysis computing device(s) 102 may maintain or may access a metrics data database 156 that stores metrics data 158. For example, as discussed additionally below, the metrics data database 156 may receive the metrics data 158 from the service computing devices 104. In addition, the analysis computing device(s) 102 may maintain or may access a root cause data structure 160 that may include root cause data 162, which may include the results provided by the root cause analysis program when determining a root cause of a respective anomaly in the services of the services system 105. Details of determining one or more root causes by the root cause analysis program based at least on the metrics data 158 are discussed additionally below.

[0026] The one or more networks 106 may include any suitable network, including a wide area network, such as the Internet; a local area network (LAN), such as an intranet; a wireless network, such as a cellular network, a local wireless network, such as Wi-Fi, and/or short-range wireless communications, such as BLUETOOTH®; a wired network including Fibre Channel, fiber optics, Ethernet, or any other such network, a direct wired connection, or any combination thereof. Accordingly, the one or more networks 106 may include both wired and/or wireless communication technologies. Components used for such communications can depend at least in part upon the type of network, the environment selected, or both. Protocols for communicating over such networks are well known and will not be discussed herein in detail. Implementations herein are not limited to any particular type of network as the networks 106.

[0027] Each client device 108 may be any suitable type of computing device such as a desktop, laptop, tablet computing device, mobile device, smart phone, wearable device, terminal, and/or any other type of computing device able to send data over a network. Client users 112 may be associated with client device(s) 108 such as through a respective user account, user login credentials, or the like. Furthermore, the client device(s) 108 may be configured to communicate with the analysis computing device(s) 102 through the one or more networks 106, through separate networks, or through any other suitable type of communication connection. Numerous other variations will be apparent to those of skill in the art having the benefit of the disclosure herein.

[0028] In some implementations, each client device 108 may include a respective instance of the client application 114 that may execute on the client device 108, such as for communicating with the service computing devices 104, such as for receiving services 109 from the service computing devices 104, or the like. In some cases, the application 114 may include a browser or may operate through a browser, while in other cases, the application 114 may include any other type of application having communication functionality enabling communication with the service computing devices 104 over the one or more networks 106. [0029] In addition, the administrative device 110 may be any suitable type of computing device such as a desktop, laptop, tablet computing device, mobile device, smart phone, wearable device, terminal, and/or any other type of computing device able to send data over a network. The administrative user 120 may be associated with the administrative device 110, such as through a respective administrator account, administrator login credentials, or the like. Furthermore, the administrative device 110 may be able to communicate with the service computing devices 104 and the analysis computing device(s) 102 through the one or more networks 106, through separate networks, or through any other suitable type of communication connection.

[0030] Further, the administrative device 110 may include a respective instance of the administrative application 122 that may execute on the administrative device 110, such as for communicating with the service computing devices 104 and/or the analysis computing device(s) 102. For example, the administrative device 110 may send instructions for configuring and managing the service computing devices 104. In addition, the administrative device 110 may send instructions for configuring and managing the analysis computing device(s) 102. In some cases, the management computing device 124 and/or the analysis computing device(s) 102 may include a management web application (not shown in FIG. 1) or other application to enable the administrative device 110 to configure operations performed by the management computing device 124 and/or the analysis computing device(s) 102. In some cases, the administrator application 122 may include a browser or may operate through a browser, while in other cases, the administrator application 122 may include any other type of application having communication functionality enabling communication with the root cause analysis program 150, the scheduling program 152, the data analysis program 154, and/or other applications and data on the analysis computing device(s) 102. The administrator application 122 may similarly communicate with the programs and applications on the service computing devices 104. Additionally, in some cases, a first administrative device 110 and a first administrative user 120 may communicate with the analysis computing device(s) 102, and a different administrative device 110 and different administrative user 120 may communicate with the service computing devices 104. Numerous other variations will be apparent to those of skill in the art having the benefit of the disclosure herein.

[0031] FIGS. 2, 4, and 5 include flow diagrams illustrating example processes. The processes are illustrated as collections of blocks in logical flow diagrams, which represent a sequence of operations, some or all of which may be implemented in hardware, software or a combination thereof. In the context of software, the blocks may represent computer-executable instructions stored on one or more computer-readable media that, when executed by one or more processors, program the processors to perform the recited operations. Generally, computer-executable instructions include routines, programs, objects, components, data structures and the like that perform particular functions or implement particular data types. The order in which the blocks are described should not be construed as a limitation. Any number of the described blocks can be combined in any order and/or in parallel to implement the process, or alternative processes, and not all of the blocks need be executed. For discussion purposes, the processes are described with reference to the environments, frameworks, and systems described in the examples herein, although the processes may be implemented in a wide variety of other environments, frameworks, and systems.

[0032] FIG. 2 is a flow diagram illustrating an example process 200 that may be executed for determining a root cause of an anomaly according to some implementations. In some cases, the process 200 may be executed at least in part by the analysis computing device 102, such as by executing the root cause analysis program 150.

[0033] In this example, the service computing devices 104 are illustrated as including a plurality of services 202(1 )-202(9). For example, each service 202 may correspond to one of the containers 142 discussed above with respect to FIG. 1. Furthermore, as discussed above, each service 202 may be able to communicate with the other services 202 such as for making a call to request that a service be performed and/or for executing a service in response to receiving a call from another service. The service computing devices 104 may be configured to provide metrics data 158 to the metrics data database 156. For example, the metrics data 158 may include multiple types of metrics such as an amount of network traffic, storage input/output information, latency, memory and processing capacity usage, error metric, and the like, for the services 202.

[0034] The scheduling program 152 may cause execution of the root cause analysis program 150 to be performed periodically such as based on a schedule by which the scheduling program 152 is configured to operate. The root cause analysis program 150 may be executed according to any desired schedule, such as every 5 minutes, every 15 minutes, every hour, twice a day, once a day, and so forth. Blocks 204-216 set forth an example algorithm that may be executed by the analysis computing device 102 by executing the root cause analysis program 150.

[0035] At 204, the computing device may receive metrics data 158 from the metrics data database 156. For example, as mentioned above, the scheduling program 152 may cause the root cause analysis program 150 to execute periodically. Upon initiation, the root cause analysis program 150 may cause the analysis computing device 102 to access the metrics data database 156 for obtaining recently received metrics data 158. [0036] At 206, the computing device may generate time series data from the received metrics data, and may also generate predicted data. For instance, time series data represents how a metric changes over time. As one example, individual metrics of the received metrics data 158 may include corresponding timestamps that indicate a time associated with each value for each individual metric. Accordingly, a time series may be generated as a set of values for a selected metric that are associated with respective points in time, and may indicate how the selected metric has changed over a specified period of time. In addition, the computing device may determine forecasted, expected, or otherwise predicted values for the same metrics. For example, one or more forecasting algorithms may be employed to determine a predicted value for a particular metric, such as based in part on past performance of the metric, and thereby provide a value for comparison with a measured value in the time series data for use in detecting anomalies in the time series data.

[0037] At 208, the computing device may detect one or more anomalies from the time series data based on the actual values and the predicted values. For example, the computing device may compare an actual value determined for a metric from the time series data with a predicted value predicted for the metric for the specified time period using the one or more prediction algorithms. For example, if the actual value differs from the predicted value by a threshold deviation, or other threshold amount, the metric may be determined to be indicative of an anomaly. Details of determining the deviation are discussed additionally below with respect to FIG. 5.

[0038] At 210, the computing device may perform clustering to determine interdependent services 202 corresponding to a suspected anomaly. An example of the clustering process is discussed additionally below with respect to FIG. 5. Clustering of services may be performed for each suspected anomaly identified at 208.

[0039] At 212, the computing device may execute one or more ranking algorithms on the clusters identified at 210. The ranking may use a variation of page ranking as one example, and/or may employ a degree-centrality ranking method as another example. In some cases the outputs of the two ranking algorithms may be aggregated to determine a normalized rank which is a weighted average of the ranks of each of the two ranking algorithms. Additional details of the ranking are discussed below, e.g., with respect to FIGS. 4, 6, and 7.

[0040] At 214, the computing device may select one or more services as the likely root cause of one or more detected anomalies based on the results of the ranking. For example, the computing device may select the highest ranked service within a cluster as the root cause of the corresponding anomaly. [0041] At 216, the computing device may send the root cause result to the root cause data structure 160 for storage as root cause of data 162. Additionally, or alternatively, the computing device may send root cause information 218 to the administrative device 110 such as to inform the administrative user 120 of the root cause of an anomaly in one or more services 202. In some examples, the administrative user 120 may send one or more instructions to the service computing devices 104 based on the received root cause information 218 such as for correcting the one or more anomalies in the one or more services 202 in which anomalies were identified, redeploying an anomalous service to a different worker computing device, or the like.

[0042] In addition, in some examples, a data analysis program 154 may access the root cause data 162 and the root cause data structure 160 for performing a data analysis on the root causes identified in anomalies in the services such as for determining trends in root causes of service anomalies and the like.

[0043] Additionally, or alternatively, the computing device 102 may send root cause information to an artificial intelligence operations (AIOPS) platform 220 that may include a root cause page 222, an incident page 224, and/or an alerts page 226 for providing information about root causes of anomalies in the services 202. As one example, the root cause analysis program 150 may be executed on or by the AIOPS platform 220 as an independent application that monitors the services system 105. The output of the root cause analysis program 150 may be displayed on a web page associated with the AIOPS platform 220 and may also be sent as a message to the administrative computing device 110.

[0044] FIG. 3 illustrates an example data structure 300 including example time series data determined from metric data according to some implementations. In this example, the data structure 300 includes a destination service 302, a source service 304, a namespace 306, an actual value 308, a predicted value 310, and a deviation score 312. For example, the destination service A is the service that was a target of a call from service B and is a member of namespace A. The actual value of the metric was determined to be “7”, while the predicted value for this metric was expected to be “20”. The analysis computing device(s) 102 calculates the deviation score 312 to determine a deviation between the actual value 308 and the predicted value 310, which in this case is determined to be 0.963. Details of calculating the deviation score 312 are discussed additionally below with respect to FIG. 5.

[0045] FIG. 4 illustrates an example process 400 showing a relationship between the clustering and ranking operations according to some implementations. The process 400 may be performed by the analysis computing device(s) 102 through execution of the root cause analysis program 150. [0046] At 402, the computing device may determine anomalous services based on deviation scores. For instance, the deviation scores may be determined as discussed above with respect to FIG. 2, and as discussed additionally below with respect to FIG. 5.

[0047] At 404, the computing device may perform clustering to identify the anomalous services and related services affected by the anomaly. Details of the clustering algorithm are discussed additionally below with respect to FIG. 5.

[0048] At 406, the computing device may isolate the affected service groups 410 (i.e., those services that are affected by an indication of an anomaly) from normal service groups 408.

[0049] At 412, the computing device performs clustering to determine root cause clusters such as a first cluster 413(1), a second cluster 413(2), and so forth. For example, once an affected service group is identified the clustering may be performed based on the deviation score and further based on the principles of the Generalized Ripple Effect (GRE). The GRE scores may be used to identify clusters of affected services. Identifying clusters of affected services enables identification of multiple root causes that may affect the service system. For instance, the number of anomalies may correspond to the number of clusters.

[0050] At 414, following the clustering, the computing device may perform ranking to determine a root cause of the anomaly by ranking the plurality of services within each cluster. Accordingly, the examples herein are able to identify multiple anomalies in the same process and determine a respective root cause for each anomaly identified.

[0051] At 416, when performing the ranking, the computing device may create a causal graph of the services within each cluster, and use one or more ranking algorithms on the services within the respective clusters determined at 412 to determine which of the services is the likely root cause of the anomaly for that cluster. Details of the ranking process are discussed additionally below with respect to FIGS. 6 and 7.

[0052] At 418, the computing device may determine a highest ranked service for each cluster. In some examples, this may include determining a first root cause 420(1) and a second root cause 420(2), such as in the case that there are two anomalies corresponding to two different clusters.

[0053] FIG. 5 illustrates an example of a process 500 for performing clustering of services for determined anomalies according to some implementations. The process 500 may be performed by the analysis computing device(s) 102 through execution of the root cause analysis program 150.

[0054] At 502, the computing device may generate time series data to determine predicted and actual values for received metrics. For example, as discussed above, with respect to FIG. 2, individual metrics of the received metrics data 158 may include corresponding timestamps that indicate a time associated with each value for each individual metric. Accordingly, a time series may be generated for a selected metric as a set of values that are associated with respective points in time, and which may indicate how the selected metric has changed over a specified period of time. In addition, the computing device may determine forecasted, expected, or otherwise predicted values for the same metrics. For example, one or more forecasting algorithms may be employed to determine a predicted value for a particular metric, such as based in part on past performance of the metric, and may thereby provide a value for comparison with a measured value in the time series data for use in detecting anomalies in the time series data.

[0055] At 504, the computing device may capture and create a dependency graph to determine interdependencies between multiple services for the specified time period. For instance, the dependency graph may determine interdependent connections between multiple services for a given time. As one example, the graph may have nodes representing services and directed edges representing dependencies between the respective services.

[0056] At 506, the computing device may preprocess data to input to the clustering algorithm. For instance, the data may be arranged as key value pairs, e.g., sample data: (key, value) : attributes : (actual, predicted), where a sample attribute may include: (destination service, source service , namespace).

[0057] At 508, the computing device may determine anomalous attributes based on calculating a deviation score for respective metrics using the actual and predicted values for the respective metrics. As one example, the following equation may be employed for determining the deviation score for a selected attribute:

where e represents (service, host) combination, /is the predicted value, and v is the actual value in the data structure 300 discussed above with respect to FIG. 3. Based on the deviation score, the computing device may filter normal and anomalous sets of attribute combinations for determining attribute combinations which are anomalous. For example, the filtering may be based on a threshold level for the deviation score which when exceeded indicates that the attribute and, therefore the corresponding services, are anomalous.

[0058] At 510, the computing device may apply a density clustering algorithm on the attribute combinations which are determined to be anomalous (see blocks 512-520 below). For instance, the clustering technique employed may be a bottom-up searching technique based on the Generalized Ripple Effect that receives as input filtered abnormal attribute combinations and which outputs a cluster of a plurality of affected services including a service that is the root cause of the anomaly. Furthermore, while one example of a clustering algorithm is provided herein, implementations herein are not limited to any specific algorithm for performing the clustering.

[0059] At 512, as part of the clustering algorithm, the computing device may calculate histograms of the deviation scores for each attribute.

[0060] At 514, the computing device may calculate centers (maximums of each of the histograms) and boundaries (minimums of each of the histograms).

[0061] At 516, the computing device may calculate a last attribute combination L such that: bins[L] < bins [center].

[0062] At 518, the computing device may calculate a first attribute combination F such that: bins[F] > bins [center].

[0063] At 520, the computing device may determine one or more anomalous clusters based on {attribute combinations belonging to anomalous list where: L< attribute combination < F}. As an example, a cluster may be of the following format: {(cluster_range: (cluster with attribute mapping (destination_service_id, source_ service_id, namespace)))}. For instance, Cluster_l: {(- 1.2, -1.1): [(6, 1, 0)] } and Cluster_2: {(-0.89, -0.7): [(5, 1, 0), (5, 2, 0), (5, 3, 0)] }.

[0064] FIG. 6 illustrates an example 600 of interdependencies between services in a cluster according to some implementations. As one example, the root cause analysis program 150 may receive services call data that is included in the metrics data 158, and that indicates calls made by the respective services to other respective services in the services system 105 discussed above with respect to FIGS. 1 and 2. In this example, suppose that there are four services that have been identified in a cluster that are related based on interdependencies determined from the services call data. For instance, suppose that a first service 602(1) and a second service 602(2) are executed on a first worker computing device 126(1), and that a third service 602(3) and a fourth service 602(4) are executed on a second worker computing device 126(2) further, suppose that the first service 602(1) places calls 604 to the second service 602(2), and the places calls 606 to the third service 602(3). In addition, suppose that the second service 602(2) places calls 608 to the third service 602(2), and places calls 610 to the fourth service 602(4). Furthermore, suppose that communication between the third service and the fourth service is affected by an anomaly as indicated at 612. In addition, suppose that the first service 602-1 is the root cause of the anomaly but this information has not yet been determined by the system. Accordingly, following the clustering to determine the four interdependent services 602-1-602-4, the analysis computing device 102 may create a causal graph based on the determined indirect interdependencies as discussed additionally below with respect to FIG. 7. [0065] FIG. 7 illustrates an example causal graph 700 according to some implementations. The causal graph 700 is a data structure that is constructed based on the services deployment structure (e.g., device-to- service associations) and detected service-to- service calls (synchronous and asynchronous), and provides an indication of the impact that the respective services have on each other. For instance, in the example of FIG. 6, the third service 602(3) and the fourth service 602(4) are executed on the second worker computing device 126(2). If the second worker computing device 126(2) is affected by an anomaly, then there is a chance that the third service 602(3) and the fourth service 602(4) would be affected. Accordingly, there is a causal relationship between the worker computing devices and the services that each worker computing device hosts. Additionally, based on a history of the metric type of each service, some examples herein may also include linking asynchronous calls that are not included in service call data received with the metrics data 158.

[0066] The causal graph data structure may be derived at least based on using host-level and service-level metrics that may be determined using causal algorithms, such as Spirtes and Glymour’s PC algorithm, as well as using the deployment structure and service-to- service call data. Similarly, in the example of FIG. 6, the first service 602(1) calls the second service 602(2), and the second service 602(2) calls the third service 602(3), and so forth. This information also helps in building the causal dependency of services through the call data included in the metrics data 158. The causal graph 700 may be employed during application of the page ranking algorithm discussed below to select the service that is identified as the root cause of the anomaly from among the affected services of the example of FIG. 6.

[0067] In the example of FIG. 6, there are four services 602(1 )-602(4). Accordingly, when creating a corresponding causal graph 700, the computing device may generate four nodes 702(1)- 702(4) corresponding to the respective four services 602(l)-602(4) discussed above with respect to FIG. 6. In addition, the computing device may add a directed edge corresponding to the calls made by each of the respective services 602(1 )-602(4) to others of the respective services 602(1)- 602(4). Accordingly, based on the call data, a first edge 704 is established from the first node 702(1) to the second node 702(2), and a second edge 706 is established from the first node 702(1) to the third node 702(3). Similarly, based on the call data, a third edge 708 is established from the second node 702(2) to the third node 702(3), and a fourth edge is established from the second node 702(2) to the fourth node 702(4). Additionally, a fifth edge 712 is established from the third node 702(3) to the fourth node 702(4).

[0068] For each cluster, a top-down approach may be applied to segregate and identify a service that is a root cause of an anomaly. For instance, if an attribute combination indicates a root cause of an anomaly is within the corresponding cluster, this implies that all of the descending attribute combinations will also be part of the same cluster and may be abnormal based on the generalized ripple effect. To capture this implication, a generalized potential score (gps) may be calculated using the following equation:

where SI represents a set of combinations of attributes of abnormal service and S2 represents a set of combinations of attributes of normal service. Additionally, f is the vector of the predicted value and v is the vector of the actual value in the data structure 300 discussed above with respect to FIG. 3. Further, a is a vector of an expected value, which can be represented by converting the GRE definition to mathematical form to obtain the following equation:

where e is the (service, host) combination and S is the root cause.

[0069] A sample output of the gps equation (EQ2) may be as follows: root_causes — ((6, 1, 0), { ‘score’: 1.0, ’attribute’: (6, 1, 0)}) root_causes — ((5, 1, 0), { ‘score’: 0.5339747498576821, ‘attribute’: (5, 1, 0)}).

[0070] The computing device may further perform data preprocessing such as by checking for asynchronous calls among the micro services which may not be explicit in the service-to-service call data. Accordingly, before applying the graph-based approach for detecting anomalous service interactions, some examples herein may first attempt to enhance the causal graph 700 based on partial correlation among different service interactions.

[0071] After the causal graph 700 has been generated, the computing device may apply one or more ranking algorithms to identify the root cause in the cluster of the anomaly. As one example, the computing device may employ a page ranking algorithm based on the PageRank algorithm. In this example, the computing device may traverse the causal graph 700 and may assign a numerical weight to each service (node 702(l)-702(4)) in the graph 700 based on the number of connections associated with each node 702. For instance, the resulting weight may represent the relative importance of a service within the network. As one example, the computing device may employ a random walk among the connections (edges) between the nodes 702 and, based on the number of connections between the respective services, assigns a respective weight to each node. The node (service) that has the highest weight is the highest ranked service within the graph 700, and is therefore selected as the root cause of the anomaly. Additionally, in some examples, based on empirical data, a personalization value may be employed for providing higher accuracy in the page ranking result. As one example, an output of the page ranking algorithm may be as follows: { ‘service_l’: 0.13, ‘service_2’: 0.23, ‘service_3’: 0.22, ‘service_4’: 0.42}.

[0072] Additionally, or alternatively, the computing device may execute a degree centrality ranking algorithm. Centrality in the causal graph 700 gives an idea of how important a node is in the graph 700 based on its connection with its neighbors. Use of degree-based centrality scores and other centrality algorithms may be used for detecting the root cause service as well. For example, the degree centrality ranking algorithm may apply a principle that the higher the indegree of a service in the graph 700, the greater the importance of that respective service. Consequently, in the case of an anomalous attribute network, the greater importance of a particular service indicates that the particular service is the root cause of the anomaly. The degree centrality algorithm may be based on the following equation:

where m(z, j)=l if there is a connection between service z and service j. An example output of the degree centrality algorithm area may be as follows: { ‘service_l’: 0, ‘service_2’: 1, ‘service_3’: 1, ‘service_4’: 2}.

[0073] As yet another example, the computing device may aggregate the outputs of the page ranking algorithm and the degree centrality algorithm by calculating a normalized rank, which may be a weighted average of the ranks from each of the two ranking algorithms (page ranking and degree centrality). In this case, a final ranking of the services represented by the causal graph 700 may be as follows: { ‘service_l’: 4, ‘service_2’: 2, ‘service_3’: 3, ‘service_4’: 1 }. Furthermore, while several example ranking algorithms are described in the implementations herein, numerous other variations will be apparent to those of skill in the art having the benefit of the disclosure herein.

[0074] FIG. 8 illustrates select example components of one or more analysis computing devices 102 that may be used to implement some of the functionality of the systems described herein. The analysis computing device(s) 102 may include one or more servers or other types of computing devices that may be embodied in any number of ways. For instance, in the case of a server, the programs, other functional components, and data may be implemented on a single server, a cluster of servers, a server farm or data center, a cloud-hosted computing service, and so forth, although other computer architectures may additionally or alternatively be used. Multiple analysis computing device(s) 102 may be located together or separately, and organized, for example, as servers, virtual servers, server banks, and/or server farms. The described functionality may be provided by the servers of a single entity or enterprise, or may be provided by the servers and/or services of multiple different entities or enterprises.

[0075] In the illustrated example, the analysis computing device(s) 102 includes, or may have associated therewith, one or more processors 802, one or more computer-readable media 804, and one or more communication interfaces 806. Each processor 802 may be a single processing unit or a number of processing units, and may include single or multiple computing units, or multiple processing cores. The processor(s) 802 can be implemented as one or more central processing units, microprocessors, microcomputers, microcontrollers, digital signal processors, graphics processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. As one example, the processor(s) 802 may include one or more hardware processors and/or logic circuits of any suitable type specifically programmed or configured to execute the algorithms and processes described herein. The processor(s) 802 may be configured to fetch and execute computer-readable instructions stored in the computer-readable media 804, which may program the processor(s) 802 to perform the functions described herein.

[0076] The computer-readable media 804 may include volatile and nonvolatile memory and/or removable and non-removable media implemented in any type of technology for storage of information, such as computer-readable instructions, data structures, program modules, or other data. For example, the computer-readable media 804 may include, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, optical storage, solid state storage, magnetic tape, and magnetic disk storage, network or cloud storage, array storage, network attached storage, a storage area network, or any other medium that can be used to store the desired information and that can be accessed by a computing device. Depending on the configuration of the analysis computing device(s) 102, the computer-readable media 804 may be a tangible non- transitory medium to the extent that, when mentioned, non-transitory computer-readable media exclude media such as energy, carrier signals, electromagnetic waves, and/or signals per se. In some cases, the computer-readable media 804 may be at the same location as the analysis computing device(s) 102, while in other examples, the computer-readable media 804 may be partially remote from the analysis computing device(s) 102.

[0077] The computer-readable media 804 may be used to store any number of functional components that are executable by the processor(s) 802. In many implementations, these functional components comprise instructions or programs that are executable by the processor(s) 802 and that, when executed, specifically program the processor(s) 802 to perform the actions attributed herein to the analysis computing device(s) 102. Functional components stored in the computer-readable media 804 may include the root cause analysis program 150, the scheduling program 152, and the data analysis program 154, each of which may include one or more computer programs, applications, executable code, or portions thereof. Further, while these programs are illustrated together in this example, during use, some or all of these programs may be executed on separate analysis computing device(s) 102.

[0078] In addition, the computer-readable media 804 may store data, data structures, and other information used for performing the functions and services described herein. For example, the computer-readable media 804 may store the metrics data database 156 including the metrics data 158, and the root cause data structure 160 including the root cause data 162. The analysis computing device 102 may also include or maintain other functional components and data, which may include programs, drivers, etc., and the data used or generated by the functional components. Further, the analysis computing device 102 may include many other logical, programmatic, and physical components, of which those described above are merely examples that are related to the discussion herein.

[0079] The one or more communication interfaces 806 may include one or more software and hardware components for enabling communication with various other devices, such as over the one or more network(s) 106. For example, the communication interface(s) 806 may enable communication through one or more of a LAN, the Internet, cable networks, cellular networks, wireless networks (e.g., Wi-Fi) and wired networks (e.g., Fibre Channel, fiber optic, Ethernet), direct connections, as well as close-range communications such as BLUETOOTH®, and the like, as additionally enumerated elsewhere herein.

[0080] In addition, the service computing devices 104 may include hardware configurations and components similar to those discussed above for the analysis computing device(s) 102, but with different functional components and data, e.g., as discussed above with respect to FIG. 1. Thus, the service computing devices 104 may each include at least one or more processors 802, one or more computer-readable media 804, and one or more communication interfaces 806.

[0081] FIG. 9 illustrates select example components of an administrative device 110 according to some implementations. The administrative device 110 may include any of a number of different types of computing devices such as a desktop, laptop, tablet computing device, mobile device, smart phone, wearable device, terminal, workstation, server, and/or any other type of computing device able to send and receive data over a network.

[0082] In the example of FIG. 9, the administrative device 110 includes components such as at least one processor 902, one or more computer-readable media 904, one or more communication interfaces 906, and one or more input/output (I/O) devices 908. Each processor 902 may itself comprise one or more processors or processing cores. For example, the processor(s) 902 can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, graphics processors, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. In some cases, the processor(s) 902 may be one or more hardware processors and/or logic circuits of any suitable type specifically programmed or configured to execute the algorithms and processes described herein. The processor(s) 902 can be configured to fetch and execute computer-readable processorexecutable instructions stored in the computer-readable media 904.

[0083] Depending on the configuration of the administrative device 110, the computer- readable media 904 may be an example of tangible non-transitory computer storage media and may include volatile and nonvolatile memory and/or removable and non-removable media implemented in any type of technology for storage of information such as computer-readable processor-executable instructions, data structures, program modules or other data. The computer- readable media 904 may include, but is not limited to, RAM, ROM, EEPROM, flash memory, solid-state storage, magnetic disk storage, optical storage, and/or other computer-readable media technology. Further, in some cases, the administrative device 110 may access external storage, such as storage arrays, network attached storage, storage area networks, cloud storage, or any other medium that can be used to store information and that can be accessed by the processor(s) 902 directly or through another computing device or network. Accordingly, the computer-readable media 904 may be computer storage media able to store instructions, modules or components that may be executed by the processor(s) 902. Further, when mentioned, non-transitory computer- readable media exclude media such as energy, carrier signals, electromagnetic waves, and signals per se.

[0084] The computer-readable media 904 may be used to store and maintain any number of functional components that are executable by the processor 902. In some implementations, these functional components comprise instructions or programs that are executable by the processor 902 and that, when executed, implement operational logic for performing the actions and services attributed above to the administrative device 110. Functional components of the administrative device 110 stored in the computer-readable media 904 may include the administrative application 122, as discussed above, which may enable the administrative device 110 to interact with the service computing devices 104 and/or the analysis computing device(s) 102.

[0085] In addition, the computer-readable media 904 may also store data, data structures and the like, that are used by the functional components. Depending on the type of the administrative device 110, the computer-readable media 904 may also optionally include other functional components and data, which may include applications, programs, drivers, etc., and the data used or generated by the functional components. Further, the administrative device 110 may include many other logical, programmatic and physical components, of which those described are merely examples that are related to the discussion herein.

[0086] The communication interface(s) 906 may include one or more interfaces and hardware components for enabling communication with various other devices, such as over the network(s) 106 or directly. For example, communication interface(s) 906 may enable communication through one or more of the Internet, cable networks, cellular networks, wireless networks (e.g., Wi-Fi) and wired networks, as well as close-range communications such as BLUETOOTH®, BLUETOOTH® low energy, and the like, as additionally enumerated elsewhere herein.

[0087] The administrative device 110 may further include the one or more I/O devices 908. The I/O devices 908 may include a display, which may include a touchscreen as an input device. The I/O devices 908 may further include speakers, a microphone, a camera, and various user controls (e.g., buttons, a joystick, a keyboard, a keypad, touchpad, mouse, etc.), a haptic output device, and so forth. Additionally, the administrative device 110 may include various other components that are not shown, examples of which include removable storage, a power source, such as a battery and power control unit, and so forth. Further, the client computing device(s) 108 may include hardware structures and components similar to those described for the administrative device 110, but with one or more different functional components, e.g., as discussed above with respect to FIG. 1.

[0088] The example processes described herein are only examples of processes provided for discussion purposes. Numerous other variations will be apparent to those of skill in the art in light of the disclosure herein. Additionally, while the disclosure herein sets forth several examples of suitable frameworks, architectures and environments for executing the processes, implementations herein are not limited to the particular examples shown and discussed. Furthermore, this disclosure provides various example implementations, as described and as illustrated in the drawings. However, this disclosure is not limited to the implementations described and illustrated herein, but can extend to other implementations, as would be known or as would become known to those skilled in the art.

[0089] Various instructions, methods, and techniques described herein may be considered in the general context of computer-executable instructions, such as computer programs and applications stored on computer-readable media, and executed by the processor(s) herein. Generally, the terms program and application may be used interchangeably, and may include instructions, routines, scripts, modules, objects, components, data structures, executable code, etc., for performing particular tasks or implementing particular data types. These programs, applications, and the like, may be executed as native code or may be downloaded and executed, such as in a virtual machine or other just-in-time compilation execution environment. Typically, the functionality of the programs and applications may be combined or distributed as desired in various implementations. An implementation of these programs, applications, and techniques may be stored on computer storage media or transmitted across some form of communication media.

[0090] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claims.

Claims

1. A system comprising: one or more processors configured by executable instructions to perform operations comprising: receiving, by the one or more processors, sampled metric data from a services system that includes a plurality of service computing devices configured to provide a plurality of services; generating, by the one or more processors, time series data based on the sampled metric data to detect an anomaly based on the sampled metric data; employing, by the one or more processors, clustering based at least on detecting the anomaly to determine a subset of services that are affected by the anomaly, wherein the subset of services is a subset of the plurality of services; and applying, by the one or more processors, at least one ranking algorithm to the subset of services for determining one or more services of the subset of services as a root cause of the anomaly.

2. The system as recited in claim 1, the operations further comprising: determining based at least on service-to-service call information, interdependencies between the services in the subset of services.

3. The system as recited in claim 2, the operations further comprising: generating a graph data structure corresponding to the interdependencies between the services in the subset of services; and selecting a service of the subset of services as the root cause of the anomaly based at least in part on the graph data structure.

4. The system as recited in claim 1, wherein the operation of applying at least one ranking algorithm comprises at least one of: applying a PageRank algorithm; or applying a degree centrality algorithm; or applying results of both the PageRank algorithm and the degree centrality algorithm.

5. The system as recited in claim 1, the operation of detecting the anomaly in the sampled metric data further comprising: determining a predicted value for a metric included in the metric data for a specified time; detecting the anomaly based on a difference between the predicted value and an actual value for the metric for the specified time.

6. The system as recited in claim 5, the operation of detecting the anomaly based on the difference between the predicted value and the actual value for the metric for the specified time further comprising determining a deviation score based on the predicted value and the actual value and comparing the deviation score to a threshold.

7. The system as recited in claim 6, the operations further comprising: determining attribute combinations based on the time series data; determining normal and anomalous sets of attribute combinations based at least in part on respective deviation scores; and applying clustering to the anomalous sets of attribute combinations to determine the subset of services.

8. The system as recited in claim 1, wherein the sampled metric data comprises at least one of: latency data, workload data, error message data, or processing capacity usage data.

9. The system as recited in claim 1, wherein the sampled metric data includes service-to- service call data and worker computing device-to- service association data.

10. The system as recited in claim 1, wherein the anomaly is a first anomaly, the subset is a first subset, and the root cause is a first root cause of the first anomaly, the operations further comprising: detecting a second anomaly based on the sampled metric data; employing the clustering to determine a second subset of services that are affected by the second anomaly; and applying the at least one ranking algorithm to the second subset of services to determiner at least one service that is the root cause of the second anomaly.

11. The system as recited in claim 1, wherein: at least some of the services of the plurality of services are provided by containerized applications executed on a plurality of service computing devices; and respective service computing devices of the plurality of service computing devices each execute one or more of the containerized applications, each containerized application providing a respective service of the plurality of services, and configured to communicate with other containerized applications on at least one of a same service computing device or on a different service computing device of the plurality of service computing devices.

12. A method comprising: receiving, by one or more processors, sampled metric data from a services system that includes a plurality of service computing devices configured to provide a plurality of services; generating, by the one or more processors, times series data based on the sampled metric data to detect an anomaly based on the sampled metric data; employing, by the one or more processors, clustering based at least on detecting the anomaly to determine a subset of services that are affected by the anomaly, wherein the subset of services is a subset of the plurality of services; and applying, by the one or more processors, at least one ranking algorithm to the subset of services for determining one or more services of the subset of services as a root cause of the anomaly.

13. The method as recited in claim 12, further comprising: determining based at least on services call information, interdependencies between the services in the subset of services; generating a graph data structure corresponding to the interdependencies between the services in the subset of services; and selecting a service of the subset of services as the root cause of the anomaly based at least in part on the graph data structure.

14. One or more non-transitory computer-readable media storing one or more programs executable by a computing device to configure the computing device to perform operations comprising: receiving sampled metric data from a services system that includes a plurality of service computing devices configured to provide a plurality of services; generating time series data based on the sampled metric data to detect an anomaly based on the sampled metric data; employing clustering based at least on detecting the anomaly to determine a subset of services that are affected by the anomaly, wherein the subset of services is a subset of the plurality of services; and applying at least one ranking algorithm to the subset of services to determine one or more services of the subset of services as a root cause of the anomaly.

15. The one or more non-transitory computer-readable media as recited in claim 14, the operations further comprising: determining based at least on services call information, interdependencies between the services in the subset of services; generating a graph data structure corresponding to the interdependencies between the services in the subset of services; and selecting a service of the subset of services as the root cause of the anomaly based at least in part on the graph data structure.