[go: up one dir, main page]

Skip to content

Analyze job duration and adjust queue weights based on execution time

Problem

Queue weights should consider job execution time to prevent long-running jobs from starving other queues. Currently, we don't have a systematic approach to factoring job duration into weight assignments.

Potential issues:

  • Long-running jobs with high weights can monopolize worker threads
  • Quick jobs with low weights may experience unnecessary delays
  • No documented relationship between job duration and appropriate weight

Proposal

Analyze job execution times across all queues and adjust weights to balance throughput and fairness.

Analysis Needed

For each queue, gather metrics on:

  1. Job duration (P50, P95, P99 percentiles)
  2. Queue depth during normal operations
  3. Job frequency (jobs per hour/day)
  4. Failure rates and retry patterns

Queues to Investigate

Potentially long-running (may need lower weights):

  • usage_billing (weight 2): ClickHouse operations, data processing
    • Billing::Usage::ConsumptionJob
    • Billing::Usage::EnrichmentJob
    • ExportChDataToS3Job
  • salesforce (weight 4): External API calls with potential timeouts
    • Salesforce::CreateOpportunityJob
    • Salesforce::CreateQuoteForReconciliationJob
  • zuora (weight 4): Complex synchronization operations
    • Zuora::RefreshLocalSubscriptionsJob
    • Zuora::SyncResourceJob

Potentially quick (could have higher weights):

  • mailers (weight 2): Email delivery (usually fast)
  • expiration (weight 3): Simple status updates
  • health_check (weight 4): Quick health checks

Weight Assignment Guidelines

Based on analysis, establish guidelines like:

Quick jobs (< 1 second average):

  • Can have higher weights (7-10) without blocking
  • Examples: Health checks, simple notifications, status updates

Medium jobs (1-10 seconds average):

  • Moderate weights (4-6) appropriate
  • Examples: API calls, database operations, email sending

Long jobs (> 10 seconds average):

  • Lower weights (2-3) to prevent starvation
  • Examples: Bulk data processing, complex synchronization, report generation

Very long jobs (> 30 seconds average):

  • Lowest weights (1-2) or consider breaking into smaller jobs
  • Examples: Large data exports, comprehensive audits

Implementation Steps

  1. Gather production metrics (last 30 days):

    # Example query for Sidekiq metrics
    # - Job duration by queue
    # - Queue depth over time
    # - Job throughput
  2. Analyze patterns:

    • Identify queues with high variance in job duration
    • Find queues where long jobs block quick jobs
    • Look for correlation between queue depth and job duration
  3. Propose weight adjustments:

    • Document current vs. proposed weights
    • Explain rationale based on metrics
    • Consider business priority alongside duration
  4. Test in staging:

    • Simulate production load
    • Measure impact on queue latency
    • Verify no unintended consequences
  5. Monitor after deployment:

    • Track queue depth changes
    • Monitor job latency (enqueue to execution time)
    • Watch for customer-reported issues
  6. Document findings:

    • Create guidelines for future queue weight assignments
    • Include typical job durations for each queue
    • Establish process for periodic review

Success Criteria

  • All queues have documented average job durations
  • Weight assignments consider both business priority and execution time
  • No queue experiences starvation due to long-running jobs in higher-priority queues
  • Clear guidelines exist for assigning weights to new queues

Related

  • Parent epic: &19587
  • Related: #14268 (weight granularity)
  • Related: #14270 (user-facing vs internal)