Add granular instrumentation to PostReceive worker for performance analysis

Part of Improve PostReceive Worker Performance to Meet ... (&19159)

Summary

To effectively optimize the PostReceive worker performance, we need detailed instrumentation to identify which specific components are causing performance bottlenecks. Current logging provides high-level metrics but lacks granular visibility into individual operations.

Problem

The PostReceive worker performs multiple tasks, but we cannot determine which specific operations are slow:

Cache expiration operations
Repository operations
CI pipeline creation (partially addressed)
Event processing
Merge request updates
Lock acquisition and holding

From the performance data, we see three main contributors:

redis_duration_s: Up to 9.8 seconds
db_duration_s: Variable database query times
cpu_s: Up to 11.6 seconds of CPU time

Proposal

We could add extra logging behind a feature flag and enable it for only a small percentage of requests. This should help us gather more information without significantly increasing log size.

from thread

Add detailed instrumentation to measure execution time for each major component:

Cache Operations
- repository.expire_caches_for_tags timing
- Branch name cache operations
- Repository cache operations
Lock Operations
- Lock acquisition wait time
- Lock hold duration by operation type
- Lock contention analysis
Database Operations
- Query-level timing for major operations
- Transaction duration breakdown
Event Processing
- Individual event creation timing
- Bulk operation performance
Integration Points
- External validation calls
- Gitaly operation breakdown
- Redis operation categorization

Implementation Details

Add instrumentation behind a feature flag (detailed_post_receive_instrumentation)
Enable for a small percentage of requests - 1%
Use structured logging with consistent field naming
Include correlation IDs for tracing related operations
Try using #log_hash_metadata_on_job_done to collect information on the number of refs processed.

Acceptance Criteria

Detailed timing logs for all major PostReceive components
Feature flag implementation for controlled rollout
Structured logging format for easy analysis
Validation on staging environment

Edited Sep 02, 2025 by Vasilii Iakliushin