Averting excessive oopses
Averting excessive oopses
Posted Nov 29, 2022 11:27 UTC (Tue) by farnz (subscriber, #17727)In reply to: Averting excessive oopses by jccleaver
Parent article: Averting excessive oopses
With a fleet of cattle, as opposed to a fleet of pets, you don't presume that all wear and tear is identical - you have monitoring that measures the state of each machine individually, and handles it. E.g. your monitoring detects that machine 10,994,421 is frequently throwing ECC corrected errors, and it migrates all jobs off that machine and sends it off for repair. Or you detect filesystem corruption, and migrate everything off then send that machine to repair.
The key is that automation handles everything if you have cattle, not pets. Instead of knowing that http10294 is a unique snowflake with certain problems and fixes for problems, you have your automation look at the state of all hardware, and identify and deal with the problems of the machines en-masse.