Blohowiak et al., 2016 - Google Patents

A platform for automating chaos experiments

Blohowiak et al., 2016

Document ID: 15044516274917100246
Author: Blohowiak A; Basiri A; Hochstein L; Rosenthal C
Publication year: 2016
Publication venue: 2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)

External Links

Cited by

Snippet

The Netflix video streaming system is composed of many interacting services. In such a large system, failures in individual services are not uncommon. This paper describes the Chaos Automation Platform, a system for running failure injection experiments on the …

Continue reading at arxiv.org (PDF) (other versions)

238000004519 manufacturing process 0 abstract description 6

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2097—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements maintaining the standby controller/processing unit updated
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant details of failing over
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0715—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a system implementing multitasking
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance or administration or management of packet switching networks
- H04L41/06—Arrangements for maintenance or administration or management of packet switching networks involving management of faults or events or alarms
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Application independent communication protocol aspects or techniques in packet data networks
- H04L69/40—Techniques for recovering from a failure of a protocol instance or entity, e.g. failover routines, service redundancy protocols, protocol state redundancy or protocol service redirection in case of a failure or disaster recovery
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements or protocols for real-time communications
- H04L65/80—QoS aspects

Similar Documents

Publication	Publication Date	Title
Blohowiak et al.	2016	A platform for automating chaos experiments
Gunawi et al.	2016	Why does the cloud stop computing? lessons from hundreds of service outages
AU2017336644B2 (en)	2020-01-30	Detecting service vulnerabilities in a distributed computing system
Huang et al.	2018	Capturing and enhancing in situ system observability for failure detection
US9842045B2 (en)	2017-12-12	Failure recovery testing framework for microservice-based applications
Chérèque et al.	1992	Active replication in Delta-4
US8812911B2 (en)	2014-08-19	Distributed testing of a software platform
US11182253B2 (en)	2021-11-23	Self-healing system for distributed services and applications
US10025678B2 (en)	2018-07-17	Method and system for automatically detecting and resolving infrastructure faults in cloud infrastructure
WO2021108452A2 (en)	2021-06-03	Systems and methods for enabling a highly available managed failover service
US20060271813A1 (en)	2006-11-30	Systems and methods for message handling among redunant application servers
US20060020858A1 (en)	2006-01-26	Method and system for minimizing loss in a computer application
US11366728B2 (en)	2022-06-21	Systems and methods for enabling a highly available managed failover service
Joshi et al.	2013	SETSUDŌ: Perturbation-based testing framework for scalable distributed systems
US20030217155A1 (en)	2003-11-20	Send of software tracer messages via IP from several sources to be stored by a remote server
Goundar et al.	2021	Efficient fault tolerance on cloud environments
Karn et al.	2022	Automated testing and resilience of microservice’s network-link using istio service mesh
Wu et al.	2018	An extensible fault tolerance testing framework for microservice-based cloud applications
CN110581785A (en)	2019-12-17	A reliability evaluation method and device
Leesatapornwongsa et al.	2014	The case for drill-ready cloud computing
US20180359317A1 (en)	2018-12-13	System and method for non-intrusive context correlation across cloud services
US20150205675A1 (en)	2015-07-23	Method and System for Improving Reliability of a Background Page
Kanso et al.	2017	Enhancing openstack fault tolerance for provisioning computing environments
Costa et al.	2017	Chrysaor: Fine-grained, fault-tolerant cloud-of-clouds mapreduce
Makhsous et al.	2016	High available deployment of cloud-based virtualized network functions