Gitlab::RepositorySetCache#fetch is not atomic
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Summary
Gitlab::RepositorySetCache#fetch
is currently implemented as follows:
def fetch(key, &block)
full_key = cache_key(key)
smembers, exists = with do |redis|
redis.multi do |multi|
multi.smembers(full_key)
multi.exists?(full_key) # rubocop:disable CodeReuse/ActiveRecord
end
end
return smembers if exists
write(key, yield)
end
I.e.
- Atomic read and existence check
- Early return if key existed
- A possibly slow
yield
to a block, calculating a value - A write (which is atomic)
This means that two concurrent invocations of #fetch
can both read exists == false
and then race to write a result.
This can result in stale reads and false negatives for checks such as branch_names_include?
. Example, where X
is be an arbitrary web request or sidekiq job invoking the cached #branch_names
method:
-
PostReceive 1
clears thebranch_names
cache -
X
invokesbranch_names
, and readsexists == false
, but theyield
is slow -
PostReceive 1
writes thebranch_names
cache -
PostReceive 2
clears thebranch_names
cache for the push to a new branch -
X
finally completes theyield
and writes the branch names cache -
PostReceive 2
's invocation ofbranch_names
getssmember == false
andexists == true
in its#fetch
, i.e. a false negative
Steps to reproduce
Injecting the following rspec
test under describe #fetch
in spec/lib/gitlab/repository_set_cache_spec.rb
reproduces the bug:
it 'is atomic' do
key = 'race'
fiber_1 = Fiber.new do
cache.fetch(key) do
Fiber.yield
['main']
end
end
fiber_2 = Fiber.new do
cache.fetch(key) do
%w[main new]
end
end
# Start fiber_1, which will yield before writing
fiber_1.resume
# Run fiber_2 to completion
result_2 = fiber_2.resume
# Resume fiber_1 to complete its execution
result_1 = fiber_1.resume
# Both fibers should have different results, demonstrating the race condition
expect(result_1).to eq(['main'])
expect(result_2).to eq(%w[main new])
# The cache now contains the result from fiber_1, which is stale.
expect(cache.fetch(key) { ['something'] }).to eq(['main'])
end
Warning: A passing test means the bug is present. When fixing the bug, change the description and final expectation to whatever the decided desired behaviour is.
Example Project
What is the current bug behavior?
The third cache.fetch(key)
the stale value ['main']
What is the expected correct behavior?
The third cache.fetch(key) { ... }
should either invoke the block or return %w[main new]
.
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)
Possible fixes
Implement optimistic locking using a third value, lock_version
, in addition to smember
and exists
. This means the multi
block in #write
would need to be replaced with a lua script, and possibly also the multi
block in #fetch
.