Batch SyncPolicyWorker with delays to prevent worker saturation
What does this MR do and why?
When a security policy is created, updated, or deleted in a large namespace with many projects, the worker immediately enqueues sync jobs for all affected projects. This causes a spike in worker queue load, potentially saturating the worker pool and degrading system performance.
This merge request addresses worker saturation issues when large namespaces perform security policy synchronization. The change introduces batching and staggered delays to the SyncPolicyWorker, preventing the worker queue from being overwhelmed by a large number of simultaneous sync jobs.
Delay Calculation Table:
| Project Count | Batch Count | Total Delay |
|---|---|---|
| 1 | 1 | 1 second |
| 10 | 1 | 1 second |
| 100 | 1 | 1 second |
| 1,000 | 10 | 10 seconds |
| 10,000 | 100 | 100 seconds (1 min 40 sec) |
| 50,000 | 500 | 500 seconds (8 min 20 sec) |
| 100,000 | 1,000 | 1,000 seconds (16 min 40 sec) |
Example Breakdown for 10,000 projects:
- Batch 1 (projects 1-100): Delayed by 1 second
- Batch 2 (projects 101-200): Delayed by 2 seconds
- Batch 3 (projects 201-300): Delayed by 3 seconds
- ...
- Batch 100 (projects 9,901-10,000): Delayed by 100 seconds
This staggered approach distributes the worker load over time, preventing queue saturation while ensuring all projects are eventually synced. In order to not introduce substantial delay in policy sync, this change is introduced behind security_policies_batched_sync_delay feature flag.
References
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #580036