Analyze job duration and adjust queue weights based on execution time
Problem
Queue weights should consider job execution time to prevent long-running jobs from starving other queues. Currently, we don't have a systematic approach to factoring job duration into weight assignments.
Potential issues:
- Long-running jobs with high weights can monopolize worker threads
- Quick jobs with low weights may experience unnecessary delays
- No documented relationship between job duration and appropriate weight
Proposal
Analyze job execution times across all queues and adjust weights to balance throughput and fairness.
Analysis Needed
For each queue, gather metrics on:
- Job duration (P50, P95, P99 percentiles)
- Queue depth during normal operations
- Job frequency (jobs per hour/day)
- Failure rates and retry patterns
Queues to Investigate
Potentially long-running (may need lower weights):
-
usage_billing
(weight 2): ClickHouse operations, data processingBilling::Usage::ConsumptionJob
Billing::Usage::EnrichmentJob
ExportChDataToS3Job
-
salesforce
(weight 4): External API calls with potential timeoutsSalesforce::CreateOpportunityJob
Salesforce::CreateQuoteForReconciliationJob
-
zuora
(weight 4): Complex synchronization operationsZuora::RefreshLocalSubscriptionsJob
Zuora::SyncResourceJob
Potentially quick (could have higher weights):
-
mailers
(weight 2): Email delivery (usually fast) -
expiration
(weight 3): Simple status updates -
health_check
(weight 4): Quick health checks
Weight Assignment Guidelines
Based on analysis, establish guidelines like:
Quick jobs (< 1 second average):
- Can have higher weights (7-10) without blocking
- Examples: Health checks, simple notifications, status updates
Medium jobs (1-10 seconds average):
- Moderate weights (4-6) appropriate
- Examples: API calls, database operations, email sending
Long jobs (> 10 seconds average):
- Lower weights (2-3) to prevent starvation
- Examples: Bulk data processing, complex synchronization, report generation
Very long jobs (> 30 seconds average):
- Lowest weights (1-2) or consider breaking into smaller jobs
- Examples: Large data exports, comprehensive audits
Implementation Steps
-
Gather production metrics (last 30 days):
# Example query for Sidekiq metrics # - Job duration by queue # - Queue depth over time # - Job throughput
-
Analyze patterns:
- Identify queues with high variance in job duration
- Find queues where long jobs block quick jobs
- Look for correlation between queue depth and job duration
-
Propose weight adjustments:
- Document current vs. proposed weights
- Explain rationale based on metrics
- Consider business priority alongside duration
-
Test in staging:
- Simulate production load
- Measure impact on queue latency
- Verify no unintended consequences
-
Monitor after deployment:
- Track queue depth changes
- Monitor job latency (enqueue to execution time)
- Watch for customer-reported issues
-
Document findings:
- Create guidelines for future queue weight assignments
- Include typical job durations for each queue
- Establish process for periodic review
Success Criteria
- All queues have documented average job durations
- Weight assignments consider both business priority and execution time
- No queue experiences starvation due to long-running jobs in higher-priority queues
- Clear guidelines exist for assigning weights to new queues