[go: up one dir, main page]

Fix race condition in load balancing sticking for concurrent requests

What does this MR do and why?

When a job transitions from created to pending state, it is inserted into the ci_pending_builds table. UpdateBuildQueueService#tick_for then calls runner.pick_build!, which sticks the runner to the primary by setting a new sticking point (LSN) in Redis.

RegisterJobService#execute uses find_caught_up_replica to find a replica that has caught up to this LSN, ensuring it can see the newly pending build.

The race condition occurs when:

  1. RegisterJobService#execute reads the sticking point (LSN) and verifies all replicas are caught up, then calls unstick() to remove the LSN from Redis.

  2. Between the verification and the unstick, another job transitions to pending and UpdateBuildQueueService#tick_for sets a new sticking point.

  3. RegisterJobService#execute continues with a queue query using a random replica (since no sticking point exists).

  4. The random replica may not yet have caught up to the new LSN, so it sees the build in a different state (e.g., created) and removes it from the ci_pending_builds table

The fix uses an atomic Lua script to only unstick if the sticking point hasn't changed since we read it. This prevents unsticking when a concurrent write has set a new LSN, ensuring the queue query uses a replica that is caught up to the latest state.

References

gitlab-com/gl-infra/production#20996

How to set up and validate locally

With actual replication

  1. On an Omnibus instance, I set up database load balancing with a streaming replica. I actually created an alias for 192.168.1.100 so I can run the replica from the same machine. An except from my /etc/gitlab/gitlab.rb:
gitlab_rails['db_load_balancing'] = { 'hosts' => ['192.168.1.100', '192.168.1.101'], 'replica_check_interval' => 60 }
postgresql['listen_address'] = '127.0.0.1'
postgresql['port'] = 5433
postgresql['trust_auth_cidr_addresses'] = ['127.0.0.1/32', '192.168.1.100/24', '192.168.1.101/24']
postgresql['wal_level'] = "replica"
postgresql['max_wal_senders'] = 5
postgresql['max_replication_slots'] = 2
postgresql['sql_replication_password'] = '<some hashed password>'
  1. Launch a CI job. If you monitor the Redis CLI (e.g. redis-cli -p 6380 monitor), you can see the Lua script in action:
1765584702.442606 [0 127.0.0.1:35040] "eval" "local key = KEYS[1]\nlocal expected_location = ARGV[1]\nlocal current_location = redis.call('GET', key)\n\nif current_location == expected_location then\n  redis.call('DEL', key)\n  return 1\nelse\n  return 0\nend\n" "1" "database-load-balancing/write-location/ci/build/1076" "135/54EBC478"
1765584702.442638 [0 lua] "GET" "database-load-balancing/write-location/ci/build/1076"
1765584702.442647 [0 lua] "DEL" "database-load-balancing/write-location/ci/build/1076"

From command-line

From the Rails console, you can also test the compare-and-unset method works:

gitlab(prod)> ::Ci::Runner.sticking.stick(:runner, 1)
=> true
gitlab(prod)> lsn = Gitlab::Redis::DbLoadBalancing.with { |redis| redis.get('database-load-balancing/write-location/ci/runner/1') }
=> "135/551002C0"
gitlab(prod)> ::Ci::Runner.sticking.send(:unstick_if_caught_up, :runner, 1, 'hello')
=> 0
gitlab(prod)> lsn = Gitlab::Redis::DbLoadBalancing.with { |redis| redis.get('database-load-balancing/write-location/ci/runner/1') }
=> "135/551002C0"
gitlab(prod)> ::Ci::Runner.sticking.send(:unstick_if_caught_up, :runner, 1, '135/551002C0')
=> 1
gitlab(prod)> lsn = Gitlab::Redis::DbLoadBalancing.with { |redis| redis.get('database-load-balancing/write-location/ci/runner/1') }
=> nil

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Stan Hu

Merge request reports

Loading