[go: up one dir, main page]

Skip to content

Gitlab::RepositorySetCache#fetch is not atomic

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Summary

Gitlab::RepositorySetCache#fetch is currently implemented as follows:

    def fetch(key, &block)
      full_key = cache_key(key)

      smembers, exists = with do |redis|
        redis.multi do |multi|
          multi.smembers(full_key)
          multi.exists?(full_key) # rubocop:disable CodeReuse/ActiveRecord
        end
      end

      return smembers if exists

      write(key, yield)
    end

I.e.

  1. Atomic read and existence check
  2. Early return if key existed
  3. A possibly slow yield to a block, calculating a value
  4. A write (which is atomic)

This means that two concurrent invocations of #fetch can both read exists == false and then race to write a result.

This can result in stale reads and false negatives for checks such as branch_names_include?. Example, where X is be an arbitrary web request or sidekiq job invoking the cached #branch_names method:

  1. PostReceive 1 clears the branch_names cache
  2. X invokes branch_names, and reads exists == false, but the yield is slow
  3. PostReceive 1 writes the branch_names cache
  4. PostReceive 2 clears the branch_names cache for the push to a new branch
  5. X finally completes the yield and writes the branch names cache
  6. PostReceive 2's invocation of branch_names gets smember == false and exists == true in its #fetch, i.e. a false negative

Steps to reproduce

Injecting the following rspec test under describe #fetch in spec/lib/gitlab/repository_set_cache_spec.rb reproduces the bug:

    it 'is atomic' do
      key = 'race'

      fiber_1 = Fiber.new do
        cache.fetch(key) do
          Fiber.yield
          ['main']
        end
      end

      fiber_2 = Fiber.new do
        cache.fetch(key) do
          %w[main new]
        end
      end

      # Start fiber_1, which will yield before writing
      fiber_1.resume

      # Run fiber_2 to completion
      result_2 = fiber_2.resume

      # Resume fiber_1 to complete its execution
      result_1 = fiber_1.resume

      # Both fibers should have different results, demonstrating the race condition
      expect(result_1).to eq(['main'])
      expect(result_2).to eq(%w[main new])

      # The cache now contains the result from fiber_1, which is stale.
      expect(cache.fetch(key) { ['something'] }).to eq(['main'])
    end

Warning: A passing test means the bug is present. When fixing the bug, change the description and final expectation to whatever the decided desired behaviour is.

Example Project

What is the current bug behavior?

The third cache.fetch(key) the stale value ['main']

What is the expected correct behavior?

The third cache.fetch(key) { ... } should either invoke the block or return %w[main new].

Relevant logs and/or screenshots

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of: sudo gitlab-rake gitlab:check SANITIZE=true)

(For installations from source run and paste the output of: sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)

(we will only investigate if the tests are passing)

Possible fixes

Implement optimistic locking using a third value, lock_version, in addition to smember and exists. This means the multi block in #write would need to be replaced with a lua script, and possibly also the multi block in #fetch.

Edited by 🤖 GitLab Bot 🤖