Fix race condition in load balancing sticking for concurrent requests
What does this MR do and why?
When a job transitions from created to pending state, it is inserted
into the ci_pending_builds table. UpdateBuildQueueService#tick_for
then calls runner.pick_build!, which sticks the runner to the primary
by setting a new sticking point (LSN) in Redis.
RegisterJobService#execute uses find_caught_up_replica to find a
replica that has caught up to this LSN, ensuring it can see the newly
pending build.
The race condition occurs when:
-
RegisterJobService#executereads the sticking point (LSN) and verifies all replicas are caught up, then callsunstick()to remove the LSN from Redis. -
Between the verification and the unstick, another job transitions to
pendingandUpdateBuildQueueService#tick_forsets a new sticking point. -
RegisterJobService#executecontinues with a queue query using a random replica (since no sticking point exists). -
The random replica may not yet have caught up to the new LSN, so it sees the build in a different state (e.g.,
created) and removes it from theci_pending_buildstable
The fix uses an atomic Lua script to only unstick if the sticking point hasn't changed since we read it. This prevents unsticking when a concurrent write has set a new LSN, ensuring the queue query uses a replica that is caught up to the latest state.
References
gitlab-com/gl-infra/production#20996
How to set up and validate locally
With actual replication
- On an Omnibus instance, I set up database load balancing with a streaming replica. I actually created an alias for
192.168.1.100so I can run the replica from the same machine. An except from my/etc/gitlab/gitlab.rb:
gitlab_rails['db_load_balancing'] = { 'hosts' => ['192.168.1.100', '192.168.1.101'], 'replica_check_interval' => 60 }
postgresql['listen_address'] = '127.0.0.1'
postgresql['port'] = 5433
postgresql['trust_auth_cidr_addresses'] = ['127.0.0.1/32', '192.168.1.100/24', '192.168.1.101/24']
postgresql['wal_level'] = "replica"
postgresql['max_wal_senders'] = 5
postgresql['max_replication_slots'] = 2
postgresql['sql_replication_password'] = '<some hashed password>'
- Launch a CI job. If you monitor the Redis CLI (e.g.
redis-cli -p 6380 monitor), you can see the Lua script in action:
1765584702.442606 [0 127.0.0.1:35040] "eval" "local key = KEYS[1]\nlocal expected_location = ARGV[1]\nlocal current_location = redis.call('GET', key)\n\nif current_location == expected_location then\n redis.call('DEL', key)\n return 1\nelse\n return 0\nend\n" "1" "database-load-balancing/write-location/ci/build/1076" "135/54EBC478"
1765584702.442638 [0 lua] "GET" "database-load-balancing/write-location/ci/build/1076"
1765584702.442647 [0 lua] "DEL" "database-load-balancing/write-location/ci/build/1076"
From command-line
From the Rails console, you can also test the compare-and-unset method works:
gitlab(prod)> ::Ci::Runner.sticking.stick(:runner, 1)
=> true
gitlab(prod)> lsn = Gitlab::Redis::DbLoadBalancing.with { |redis| redis.get('database-load-balancing/write-location/ci/runner/1') }
=> "135/551002C0"
gitlab(prod)> ::Ci::Runner.sticking.send(:unstick_if_caught_up, :runner, 1, 'hello')
=> 0
gitlab(prod)> lsn = Gitlab::Redis::DbLoadBalancing.with { |redis| redis.get('database-load-balancing/write-location/ci/runner/1') }
=> "135/551002C0"
gitlab(prod)> ::Ci::Runner.sticking.send(:unstick_if_caught_up, :runner, 1, '135/551002C0')
=> 1
gitlab(prod)> lsn = Gitlab::Redis::DbLoadBalancing.with { |redis| redis.get('database-load-balancing/write-location/ci/runner/1') }
=> nil
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.