CI: Add [runner_system_failure] and [stuck_or_timeout_failure] retries handlers to the default section
Summary
This MR introduces a default retry policy at the global handler level to support the transition to GCP spot instances for runners with the gcp tag. The default retry configuration will automatically retry jobs that fail due to runner_system_failure and stuck_or_timeout_failure, which are common failure modes when using spot instances.
Changes
Default Retry Policy
-
Added default retry configuration that applies to all jobs:
-
max: 2retries - Retry on:
stuck_or_timeout_failureandrunner_system_failure
-
Cleanup of Redundant Retry Configurations
- Removed redundant retry policies from individual CI jobs where they duplicate the new default behavior
- Retained specific retry configurations only where jobs need different retry behavior than the default
Rationale
With the transition to GCP spot instances for cost optimization, we need to handle the increased likelihood of infrastructure-related failures gracefully. Spot instances can be preempted at any time, leading to runner_system_failure scenarios. Additionally, stuck_or_timeout_failure can occur during spot instance provisioning delays.
By setting these retry policies at the default level:
- Consistency: All jobs automatically get appropriate retry behavior for spot instance failures
- Maintenance: Reduces the need to manually add retry policies to each job
- Resilience: Improves pipeline reliability during the GCP spot instance transition
- Clean Code: Eliminates redundant retry configurations throughout the CI configuration
Testing
-
CI pipeline generates correctly with the new default configuration -
Jobs that previously had matching retry policies now inherit from default -
Jobs with specific retry requirements maintain their custom configurations -
Generated YAML file contains the expected default retry block
Edited by Neo