Chen et al., 2013 - Google Patents
Predicting job completion times using system logs in supercomputing clustersChen et al., 2013
- Document ID
- 11392063639617213899
- Author
- Chen X
- Lu C
- Pattabiraman K
- Publication year
- Publication venue
- 2013 43rd Annual IEEE/IFIP Conference on Dependable Systems and Networks Workshop (DSN-W)
External Links
Snippet
Most large systems such as HPC/cloud computing clusters and data centers are built from commercial off-the-shelf components. System logs are usually the main source of choice to gain insights into the system issues. Therefore, mining logs to diagnose anomalies has been …
- 238000004458 analytical method 0 abstract description 9
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Programme initiating; Programme switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3604—Software analysis for verifying properties of programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/362—Software debugging
- G06F11/3636—Software debugging by tracing the execution of the program
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/86—Event-based monitoring
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06Q—DATA PROCESSING SYSTEMS OR METHODS, SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL, SUPERVISORY OR FORECASTING PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL, SUPERVISORY OR FORECASTING PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management, e.g. organising, planning, scheduling or allocating time, human or machine resources; Enterprise planning; Organisational models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N99/00—Subject matter not provided for in other groups of this subclass
- G06N99/005—Learning machines, i.e. computer in which a programme is changed according to experience gained by the machine itself during a complete run
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | Predicting job completion times using system logs in supercomputing clusters | |
Gao et al. | Task failure prediction in cloud data centers using deep learning | |
US10515002B2 (en) | Utilizing artificial intelligence to test cloud applications | |
CN105283848B (en) | Application Tracing with Distributed Objects | |
Salfner et al. | A survey of online failure prediction methods | |
CN105283866B (en) | Application tracking method and system including optimization analysis using similar frequencies | |
Islam et al. | Predicting application failure in cloud: A machine learning approach | |
Chen et al. | Failure prediction of jobs in compute clouds: A google cluster case study | |
US8862727B2 (en) | Problem determination and diagnosis in shared dynamic clouds | |
US8997063B2 (en) | Periodicity optimization in an automated tracing system | |
US8566803B2 (en) | Benchmark profiling for distributed systems | |
US20150347273A1 (en) | Deploying Trace Objectives Using Cost Analyses | |
Minet et al. | Analyzing traces from a google data center | |
EP4052125B1 (en) | Mitigating slow instances in large-scale streaming pipelines | |
Cremonesi et al. | Indirect estimation of service demands in the presence of structural changes | |
Maroulis et al. | A holistic energy-efficient real-time scheduler for mixed stream and batch processing workloads | |
Prats et al. | Automatic generation of workload profiles using unsupervised learning pipelines | |
CN111769974A (en) | A method for fault diagnosis of cloud system | |
Ouyang et al. | An approach for modeling and ranking node-level stragglers in cloud datacenters | |
Ouyang et al. | ML-NA: A machine learning based node performance analyzer utilizing straggler statistics | |
Rao et al. | Online measurement of the capacity of multi-tier websites using hardware performance counters | |
US20250036971A1 (en) | Managing data processing system failures using hidden knowledge from predictive models | |
Liang | Data Characterization and Anomaly Detection for HPC Datacenters Using Machine Learning | |
Caton et al. | Dynamic model evaluation to accelerate distributed machine learning | |
Yang et al. | An Improved Linux Priority Scheduling Method Based on XGBoost |