Blohowiak et al., 2016 - Google Patents
A platform for automating chaos experimentsBlohowiak et al., 2016
View PDF- Document ID
- 15044516274917100246
- Author
- Blohowiak A
- Basiri A
- Hochstein L
- Rosenthal C
- Publication year
- Publication venue
- 2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)
External Links
Snippet
The Netflix video streaming system is composed of many interacting services. In such a large system, failures in individual services are not uncommon. This paper describes the Chaos Automation Platform, a system for running failure injection experiments on the …
- 238000004519 manufacturing process 0 abstract description 6
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2097—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements maintaining the standby controller/processing unit updated
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant details of failing over
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0715—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a system implementing multitasking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance or administration or management of packet switching networks
- H04L41/06—Arrangements for maintenance or administration or management of packet switching networks involving management of faults or events or alarms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Application independent communication protocol aspects or techniques in packet data networks
- H04L69/40—Techniques for recovering from a failure of a protocol instance or entity, e.g. failover routines, service redundancy protocols, protocol state redundancy or protocol service redirection in case of a failure or disaster recovery
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L65/00—Network arrangements or protocols for real-time communications
- H04L65/80—QoS aspects
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Blohowiak et al. | A platform for automating chaos experiments | |
Gunawi et al. | Why does the cloud stop computing? lessons from hundreds of service outages | |
AU2017336644B2 (en) | Detecting service vulnerabilities in a distributed computing system | |
Huang et al. | Capturing and enhancing in situ system observability for failure detection | |
US9842045B2 (en) | Failure recovery testing framework for microservice-based applications | |
Chérèque et al. | Active replication in Delta-4 | |
US8812911B2 (en) | Distributed testing of a software platform | |
US11182253B2 (en) | Self-healing system for distributed services and applications | |
US10025678B2 (en) | Method and system for automatically detecting and resolving infrastructure faults in cloud infrastructure | |
WO2021108452A2 (en) | Systems and methods for enabling a highly available managed failover service | |
US20060271813A1 (en) | Systems and methods for message handling among redunant application servers | |
US20060020858A1 (en) | Method and system for minimizing loss in a computer application | |
US11366728B2 (en) | Systems and methods for enabling a highly available managed failover service | |
Joshi et al. | SETSUDŌ: Perturbation-based testing framework for scalable distributed systems | |
US20030217155A1 (en) | Send of software tracer messages via IP from several sources to be stored by a remote server | |
Goundar et al. | Efficient fault tolerance on cloud environments | |
Karn et al. | Automated testing and resilience of microservice’s network-link using istio service mesh | |
Wu et al. | An extensible fault tolerance testing framework for microservice-based cloud applications | |
CN110581785A (en) | A reliability evaluation method and device | |
Leesatapornwongsa et al. | The case for drill-ready cloud computing | |
US20180359317A1 (en) | System and method for non-intrusive context correlation across cloud services | |
US20150205675A1 (en) | Method and System for Improving Reliability of a Background Page | |
Kanso et al. | Enhancing openstack fault tolerance for provisioning computing environments | |
Costa et al. | Chrysaor: Fine-grained, fault-tolerant cloud-of-clouds mapreduce | |
Makhsous et al. | High available deployment of cloud-based virtualized network functions |