[go: up one dir, main page]

Skip to content

Add granular instrumentation to PostReceive worker for performance analysis

Part of Improve PostReceive Worker Performance to Meet ... (&19159)

Summary

To effectively optimize the PostReceive worker performance, we need detailed instrumentation to identify which specific components are causing performance bottlenecks. Current logging provides high-level metrics but lacks granular visibility into individual operations.

Problem

The PostReceive worker performs multiple tasks, but we cannot determine which specific operations are slow:

  • Cache expiration operations
  • Repository operations
  • CI pipeline creation (partially addressed)
  • Event processing
  • Merge request updates
  • Lock acquisition and holding

From the performance data, we see three main contributors:

  • redis_duration_s: Up to 9.8 seconds
  • db_duration_s: Variable database query times
  • cpu_s: Up to 11.6 seconds of CPU time

Proposal

We could add extra logging behind a feature flag and enable it for only a small percentage of requests. This should help us gather more information without significantly increasing log size.

from thread

Add detailed instrumentation to measure execution time for each major component:

  1. Cache Operations

    • repository.expire_caches_for_tags timing
    • Branch name cache operations
    • Repository cache operations
  2. Lock Operations

    • Lock acquisition wait time
    • Lock hold duration by operation type
    • Lock contention analysis
  3. Database Operations

    • Query-level timing for major operations
    • Transaction duration breakdown
  4. Event Processing

    • Individual event creation timing
    • Bulk operation performance
  5. Integration Points

    • External validation calls
    • Gitaly operation breakdown
    • Redis operation categorization

Implementation Details

  • Add instrumentation behind a feature flag (detailed_post_receive_instrumentation)
  • Enable for a small percentage of requests - 1%
  • Use structured logging with consistent field naming
  • Include correlation IDs for tracing related operations
  • Try using #log_hash_metadata_on_job_done to collect information on the number of refs processed.

Acceptance Criteria

  • Detailed timing logs for all major PostReceive components
  • Feature flag implementation for controlled rollout
  • Structured logging format for easy analysis
  • Validation on staging environment
Edited by Vasilii Iakliushin