Unset SidekiqStatus on job failure
What does this MR do and why?
This MR fixes a critical bug in Gitlab::SidekiqStatus::ServerMiddleware where failed jobs were not cleaning up their status keys in Redis, causing SidekiqStatus.running?(job_id) to incorrectly return true for up to 30 minutes after job failure.
The Problem: When Sidekiq jobs raised exceptions, the SidekiqStatus.unset call was never reached, leaving status keys orphaned in Redis until their TTL expired (30 minutes by default).
The Solution: Refactored the middleware to use an ensure block, guaranteeing status cleanup regardless of whether the job succeeds or fails.
References
Issue: #524761
How to set up and validate locally
- On
master, raise an error in theperformmethod of any job having astatus_expirationsidekiq option. e.g inRepositoryImportWorker:
class RepositoryImportWorker # rubocop:disable Scalability/IdempotentWorker
include ApplicationWorker
data_consistency :always
include ExceptionBacktrace
include ProjectStartImport
include Sidekiq::InterruptionsExhausted
feature_category :importers
worker_has_external_dependencies!
# Do not retry on Import/Export until https://gitlab.com/gitlab-org/gitlab/-/issues/16812 is solved.
sidekiq_options retry: false, dead: false
sidekiq_options status_expiration: Gitlab::Import::StuckImportJob::IMPORT_JOBS_EXPIRATION
worker_resource_boundary :memory
sidekiq_interruptions_exhausted do |job|
new.perform_failure(job['args'].first)
end
def perform(project_id)
raise 'test'
Gitlab::QueryLimiting.disable!('https://gitlab.com/gitlab-org/gitlab/-/issues/464677')
...
- Start a rails console
- Find a
Projector create one:
project_id = Project.first.id
- Enqueue the job:
job_id = RepositoryImportWorker.perform_async(project_id)
- Observe after a few seconds that the
Gitlab::SidekiqStatus.running?(job_id)will return true even though the job has failed. - Switch to that branch, and this time call the
sleepmethod right before raising in theperformmethod:
def perform(project_id)
sleep 10
raise 'test'
Gitlab::QueryLimiting.disable!('https://gitlab.com/gitlab-org/gitlab/-/issues/464677')
...
- Restart the rails-background-jobs service and the rails console, and enqueue the job one more time.
- Observe immediately after calling the job that
Gitlab::SidekiqStatus.running?(job_id)returnstrue. - Wait 10 seconds, and call
Gitlab::SidekiqStatus.running?(job_id)again. It should now returnfalse, which is the new intended behaviour as the job failed.
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Edited by Victor Prêté