[go: up one dir, main page]

Unset SidekiqStatus on job failure

What does this MR do and why?

This MR fixes a critical bug in Gitlab::SidekiqStatus::ServerMiddleware where failed jobs were not cleaning up their status keys in Redis, causing SidekiqStatus.running?(job_id) to incorrectly return true for up to 30 minutes after job failure.

The Problem: When Sidekiq jobs raised exceptions, the SidekiqStatus.unset call was never reached, leaving status keys orphaned in Redis until their TTL expired (30 minutes by default).

The Solution: Refactored the middleware to use an ensure block, guaranteeing status cleanup regardless of whether the job succeeds or fails.

References

Issue: #524761

How to set up and validate locally

  • On master, raise an error in the perform method of any job having a status_expiration sidekiq option. e.g in RepositoryImportWorker :
class RepositoryImportWorker # rubocop:disable Scalability/IdempotentWorker
  include ApplicationWorker

  data_consistency :always
  include ExceptionBacktrace
  include ProjectStartImport
  include Sidekiq::InterruptionsExhausted

  feature_category :importers
  worker_has_external_dependencies!
  # Do not retry on Import/Export until https://gitlab.com/gitlab-org/gitlab/-/issues/16812 is solved.
  sidekiq_options retry: false, dead: false
  sidekiq_options status_expiration: Gitlab::Import::StuckImportJob::IMPORT_JOBS_EXPIRATION
  worker_resource_boundary :memory

  sidekiq_interruptions_exhausted do |job|
    new.perform_failure(job['args'].first)
  end

  def perform(project_id)
    raise 'test'

    Gitlab::QueryLimiting.disable!('https://gitlab.com/gitlab-org/gitlab/-/issues/464677')
    
    ...
  • Start a rails console
  • Find a Project or create one:
project_id = Project.first.id
  • Enqueue the job:
job_id = RepositoryImportWorker.perform_async(project_id)
  • Observe after a few seconds that the Gitlab::SidekiqStatus.running?(job_id) will return true even though the job has failed.
  • Switch to that branch, and this time call the sleep method right before raising in the perform method:
  def perform(project_id)
    sleep 10
    raise 'test'

    Gitlab::QueryLimiting.disable!('https://gitlab.com/gitlab-org/gitlab/-/issues/464677')
    
    ...
  • Restart the rails-background-jobs service and the rails console, and enqueue the job one more time.
  • Observe immediately after calling the job that Gitlab::SidekiqStatus.running?(job_id) returns true.
  • Wait 10 seconds, and call Gitlab::SidekiqStatus.running?(job_id) again. It should now return false, which is the new intended behaviour as the job failed.

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Victor Prêté

Merge request reports

Loading