[go: up one dir, main page]

Skip to content

Create low-priority maintenance queue for cleanup and audit tasks

Problem

Currently, maintenance and cleanup tasks are mixed with business-critical operations in various queues, particularly in the cron queue. These tasks include:

  • Data cleanup operations
  • Audit logging and verification
  • Non-critical synchronization
  • Archive operations
  • Test data cleanup
  • Weekly/monthly reports

Issues:

  1. Maintenance tasks can delay business-critical operations
  2. No clear separation between operational and maintenance work
  3. Difficult to schedule maintenance during low-traffic periods
  4. Can't easily throttle or pause maintenance work during incidents

Proposal

Create a dedicated low-priority queue for maintenance, cleanup, and audit tasks that can run when system resources are available.

New Queue: maintenance (weight 1-2)

Purpose: Non-urgent background tasks that improve system health but don't directly impact customers

Characteristics:

  • Lowest priority (or second-lowest after action_mailbox queues)
  • Can be paused during incidents without customer impact
  • Ideal for running during off-peak hours
  • Should not block any customer-facing operations

Jobs to Move Here

From cron queue:

  • Quality::TestAccountCleanupCronJob - Test data cleanup
  • Cron::Zuora::LocalCopyAuditJob - Data consistency audits
  • Cron::ErrorMonitorings::WeeklyReportJob - Weekly reporting
  • AuditProvisionsCronJob - Provision auditing

From other queues (if applicable):

  • Data archival jobs
  • Log cleanup operations
  • Stale record cleanup
  • Database maintenance tasks
  • Cache warming operations (non-critical)

Future additions:

  • Any new audit or cleanup jobs
  • Performance optimization tasks
  • Data quality checks
  • Metrics aggregation (non-real-time)

Benefits

  1. Better resource utilization: Maintenance runs when system has capacity
  2. Improved reliability: Critical operations never blocked by cleanup tasks
  3. Easier incident management: Can pause maintenance queue during incidents
  4. Clear separation: Obvious distinction between operational and maintenance work
  5. Flexible scheduling: Can adjust maintenance queue processing based on load

Implementation Steps

  1. Identify all maintenance tasks:

    • Audit current cron jobs
    • Search for cleanup/audit jobs in codebase
    • Categorize by urgency and customer impact
  2. Create base job class:

    # app/jobs/maintenance/base_job.rb
    module Maintenance
      class BaseJob < ApplicationJob
        queue_as :maintenance
        
        # Common configuration for maintenance jobs
        # - Lower retry attempts
        # - Longer timeouts acceptable
        # - Can be safely discarded if queue too deep
      end
    end
  3. Update job classes:

    # Example: Test cleanup
    class Quality::TestAccountCleanupCronJob < Maintenance::BaseJob
      def perform
        # Cleanup logic
      end
    end
  4. Update config/sidekiq.yml:

    :queues:
      # ... higher priority queues ...
      - [maintenance, 2]  # or 1, depending on action_mailbox priority
      - [action_mailbox_routing, 1]
      - [action_mailbox_incineration, 1]
  5. Add queue management:

    # Ability to pause/resume maintenance queue
    # Useful during incidents or high-load periods
    module MaintenanceQueue
      def self.pause!
        # Pause processing
      end
      
      def self.resume!
        # Resume processing
      end
    end
  6. Document guidelines:

    • When to use maintenance queue
    • How to pause/resume during incidents
    • Expected SLAs (can be hours or days)
    • Examples of maintenance vs. operational jobs

Queue Assignment Guidelines

Use maintenance queue for:

  • Data cleanup (old records, test data)
  • Audit and verification tasks
  • Non-critical synchronization
  • Report generation (weekly, monthly)
  • Archive operations
  • Performance optimization tasks
  • Data quality checks

Do NOT use maintenance queue for:

  • Customer-facing operations
  • Revenue-impacting tasks
  • Time-sensitive notifications
  • Real-time synchronization
  • Security-critical operations

Monitoring and Alerting

Metrics to track:

  • Queue depth (alert if > 1000 jobs)
  • Job age (alert if oldest job > 7 days)
  • Failure rate (alert if > 10%)
  • Processing rate (jobs per hour)

Acceptable delays:

  • Hours: Cleanup tasks can wait
  • Days: Audit tasks can be delayed
  • Weeks: Historical reports can be very delayed

Not acceptable:

  • Should not grow unbounded
  • Should not fail repeatedly
  • Should complete eventually (within weeks)

Success Criteria

  • All maintenance tasks identified and moved to new queue
  • Maintenance queue has lowest priority (weight 1-2)
  • Can pause/resume maintenance queue without impact
  • Clear documentation for future maintenance jobs
  • No customer-facing operations in maintenance queue

Related

  • Parent epic: &19587
  • Related: #14269 (split cron queue)
  • Related: #14273 (default queue audit)