Create low-priority maintenance queue for cleanup and audit tasks
Problem
Currently, maintenance and cleanup tasks are mixed with business-critical operations in various queues, particularly in the cron
queue. These tasks include:
- Data cleanup operations
- Audit logging and verification
- Non-critical synchronization
- Archive operations
- Test data cleanup
- Weekly/monthly reports
Issues:
- Maintenance tasks can delay business-critical operations
- No clear separation between operational and maintenance work
- Difficult to schedule maintenance during low-traffic periods
- Can't easily throttle or pause maintenance work during incidents
Proposal
Create a dedicated low-priority queue for maintenance, cleanup, and audit tasks that can run when system resources are available.
maintenance
(weight 1-2)
New Queue: Purpose: Non-urgent background tasks that improve system health but don't directly impact customers
Characteristics:
- Lowest priority (or second-lowest after action_mailbox queues)
- Can be paused during incidents without customer impact
- Ideal for running during off-peak hours
- Should not block any customer-facing operations
Jobs to Move Here
From cron
queue:
-
Quality::TestAccountCleanupCronJob
- Test data cleanup -
Cron::Zuora::LocalCopyAuditJob
- Data consistency audits -
Cron::ErrorMonitorings::WeeklyReportJob
- Weekly reporting -
AuditProvisionsCronJob
- Provision auditing
From other queues (if applicable):
- Data archival jobs
- Log cleanup operations
- Stale record cleanup
- Database maintenance tasks
- Cache warming operations (non-critical)
Future additions:
- Any new audit or cleanup jobs
- Performance optimization tasks
- Data quality checks
- Metrics aggregation (non-real-time)
Benefits
- Better resource utilization: Maintenance runs when system has capacity
- Improved reliability: Critical operations never blocked by cleanup tasks
- Easier incident management: Can pause maintenance queue during incidents
- Clear separation: Obvious distinction between operational and maintenance work
- Flexible scheduling: Can adjust maintenance queue processing based on load
Implementation Steps
-
Identify all maintenance tasks:
- Audit current cron jobs
- Search for cleanup/audit jobs in codebase
- Categorize by urgency and customer impact
-
Create base job class:
# app/jobs/maintenance/base_job.rb module Maintenance class BaseJob < ApplicationJob queue_as :maintenance # Common configuration for maintenance jobs # - Lower retry attempts # - Longer timeouts acceptable # - Can be safely discarded if queue too deep end end
-
Update job classes:
# Example: Test cleanup class Quality::TestAccountCleanupCronJob < Maintenance::BaseJob def perform # Cleanup logic end end
-
Update
config/sidekiq.yml
::queues: # ... higher priority queues ... - [maintenance, 2] # or 1, depending on action_mailbox priority - [action_mailbox_routing, 1] - [action_mailbox_incineration, 1]
-
Add queue management:
# Ability to pause/resume maintenance queue # Useful during incidents or high-load periods module MaintenanceQueue def self.pause! # Pause processing end def self.resume! # Resume processing end end
-
Document guidelines:
- When to use
maintenance
queue - How to pause/resume during incidents
- Expected SLAs (can be hours or days)
- Examples of maintenance vs. operational jobs
- When to use
Queue Assignment Guidelines
Use maintenance
queue for:
-
✅ Data cleanup (old records, test data) -
✅ Audit and verification tasks -
✅ Non-critical synchronization -
✅ Report generation (weekly, monthly) -
✅ Archive operations -
✅ Performance optimization tasks -
✅ Data quality checks
Do NOT use maintenance
queue for:
-
❌ Customer-facing operations -
❌ Revenue-impacting tasks -
❌ Time-sensitive notifications -
❌ Real-time synchronization -
❌ Security-critical operations
Monitoring and Alerting
Metrics to track:
- Queue depth (alert if > 1000 jobs)
- Job age (alert if oldest job > 7 days)
- Failure rate (alert if > 10%)
- Processing rate (jobs per hour)
Acceptable delays:
- Hours: Cleanup tasks can wait
- Days: Audit tasks can be delayed
- Weeks: Historical reports can be very delayed
Not acceptable:
- Should not grow unbounded
- Should not fail repeatedly
- Should complete eventually (within weeks)
Success Criteria
- All maintenance tasks identified and moved to new queue
- Maintenance queue has lowest priority (weight 1-2)
- Can pause/resume maintenance queue without impact
- Clear documentation for future maintenance jobs
- No customer-facing operations in maintenance queue