Add granular instrumentation to PostReceive worker for performance analysis
Part of Improve PostReceive Worker Performance to Meet ... (&19159)
Summary
To effectively optimize the PostReceive
worker performance, we need detailed instrumentation to identify which specific components are causing performance bottlenecks. Current logging provides high-level metrics but lacks granular visibility into individual operations.
Problem
The PostReceive
worker performs multiple tasks, but we cannot determine which specific operations are slow:
- Cache expiration operations
- Repository operations
- CI pipeline creation (partially addressed)
- Event processing
- Merge request updates
- Lock acquisition and holding
From the performance data, we see three main contributors:
-
redis_duration_s
: Up to 9.8 seconds -
db_duration_s
: Variable database query times -
cpu_s
: Up to 11.6 seconds of CPU time
Proposal
We could add extra logging behind a feature flag and enable it for only a small percentage of requests. This should help us gather more information without significantly increasing log size.
Add detailed instrumentation to measure execution time for each major component:
-
Cache Operations
-
repository.expire_caches_for_tags
timing - Branch name cache operations
- Repository cache operations
-
-
Lock Operations
- Lock acquisition wait time
- Lock hold duration by operation type
- Lock contention analysis
-
Database Operations
- Query-level timing for major operations
- Transaction duration breakdown
-
Event Processing
- Individual event creation timing
- Bulk operation performance
-
Integration Points
- External validation calls
- Gitaly operation breakdown
- Redis operation categorization
Implementation Details
- Add instrumentation behind a feature flag (
detailed_post_receive_instrumentation
) - Enable for a small percentage of requests - 1%
- Use structured logging with consistent field naming
- Include correlation IDs for tracing related operations
- Try using
#log_hash_metadata_on_job_done
to collect information on the number of refs processed.
Acceptance Criteria
-
Detailed timing logs for all major PostReceive components -
Feature flag implementation for controlled rollout -
Structured logging format for easy analysis -
Validation on staging environment