Repository pull mirror scheduling design
This is a design proposal to make the way we schedule pull mirror updates more efficient and scalable.
The summary is:
Current scheduling strategy is complex because
- we use a batch-style cronjob but actually need higher frequency than once per minute
- there's no feedback loop when jobs finish, making it harder to efficiently use resources.
The proposal here is to use a simple postgres-based queuing solution.
Background / Motivation
GitLab.com has about 110,000 mirrors set up. We aim to update each of these mirrors at least every 30 minutes. Additionally, we allow users to individually request an immediate update of their mirrors.
We are currently facing issues with Redis (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7157) and with delays in pull mirroring specifically (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7173).
Quoting from the root-cause analysis in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7173:
Why? New mirrors were becoming "ready to update" at a higher rate than we were updating mirrors that were already ready to update Why? We were updating mirrors at a lower rate than our pullmirror nodes are capable of Why? There were too few RepositoryUpdateMirrorWorker jobs in the repository_update_mirror queue, leaving pullmirror nodes sitting idly by when they could be updating mirrors Why? UpdateAllMirrorsWorker couldn't schedule mirror updates fast enough to keep up with pullmirror nodes becoming ready for new work Why? The LPUSH Redis operation used to push new jobs onto the queue took significantly more time to complete than it would and has under healthier Redis conditions
We were not able to schedule mirror updates fast enough to keep up with mirrors becoming eligible for updates.
Current design
There is an awesome deep-dive video over here: https://www.youtube.com/watch?v=sSZq0fpdY-Y
With the current implemention, there are 3 types of jobs:
UpdateAllMirrorsWorker
ProjectImportScheduleWorker
RepositoryUpdateMirrorWorker
Simplified, UpdateAllMirrorsWorker
is a Sidekiq cronjob run every minute. It picks the next batch of jobs and enqueues them in Redis (ProjectImportScheduleWorker
jobs). It waits until the current batch completed its jobs, effectively waiting for the slowest job of the batch to complete (background). It then re-schedules itself eagerly, so that the scheduling job is actually run at a higher frequency than every minute. This is because the minimum cronjob frequency for Sidekiq is one minute.
The ProjectImportScheduleWorker
job basically creates a state machine record (ProjectImportState
) to follow the job's execution. It then schedules another job of type RepositoryUpdateMirrorWorker
which performs the actual work of updating mirrors.
The goal of UpdateAllMirrorsWorker
is to always schedule enough jobs such that the Sidekiq workers always have something to work on. On the other hand, it tries to also minimize the queue length. This is because we need to be able to respond to the "Update now!" request from users and work on those jobs first.
The scheduling logic in UpdateAllMirrorsWorker
is quite complex. I believe this is because of an impedance mismatch: On the worker side, we want to maximize resource utilization and always keep workers busy. However, there's no feedback loop when a job has finished. It is up to the cronjob-style scheduling job to determine whether or not new jobs should be enqueued. This is a batch-job which runs at fixed intervals, whereas we should be taken these decisions in a streaming fashion in order avoid unused resources.
With >110k mirrors configured currently, we end up running >220k sidekiq jobs plus the UpdateAllMirrorsWorker
job per 30 minutes, translating to more than 120 sidekiq jobs per second.
Abstract model
We have a large number (>110k) of jobs we want to execute at least every 30 minutes. This translates to an average rate of about 60 jobs per second. In addition to regular scheduled updates, users may request immediate updates for individual repositories ("Update now").
We can model this as a priority queue Q
: A job has a desired next execution timestamp. Jobs are being queued in order of said timestamp. A pop(Q)
retrieves the next job to be executed and removes the job from the queue. Once the job has finished, it is rescheduled to run in 30 minutes and put back to the queue.
We employ n
number of worker processes to consume from this queue and run jobs.
The goals for this model are:
- Retrieving the next job from the queue with
pop(Q)
needs to be efficient inO(1)
- In order to efficiently use resources, each of the
n
processes should always have work to do (unless the queue is empty). - Ability to scale out to be able to provide updates to all mirrors every 30 minutes (or less).
Proposed design
The proposal here is to not use sidekiq to handle those recurring jobs. Instead, we run n
worker processes which consume from a Postgres-based queue implementation. There is no need for a centralized scheduler as those workers coordinate amongst themselves without much overhead.
Queue implementation
In order to implement a queue in postgres, a FOR UPDATE SKIP LOCKED
pattern can be used to efficiently implement pop(Q)
. Let's illustrate with this example:
CREATE TABLE jobs (id serial primary key, scheduled_for timestamp with time zone not null);
CREATE INDEX ON jobs (scheduled_for, id);
Jobs would reside in jobs
table with a timestamp scheduled_for
. After current time has passed this timestamp, the job is eligible to be executed.
In order to retrieve the next job (pop(Q)
) from the queue, we do the following:
BEGIN;
SELECT * FROM jobs WHERE scheduled_for <= NOW() ORDER BY scheduled_for, id LIMIT 1 FOR UPDATE SKIP LOCKED;
-- execute job's work
UPDATE jobs SET scheduled_for = NOW() + 30.minutes WHERE id = ...
COMMIT;
With an index in place, retrieving the next job is in O(1)
(or the number of currently executing jobs because we scan the index until we find the next record that hasn't been locked yet). More work needs to be done on the database-side when the job is rescheduled - this requires an index update in O(log N)
(N: number of jobs) and an update to the record itself (which results in dead tuples).
In order to limit transaction length, the job's work should be done outside the transaction but I'll simplify over that for now.
After the job has finished, we update the scheduled_for
timestamp to the next desired execution timestamp, which effectively puts the job back into the queue.
Worker efficiency, scaling out, robustness
Since there's no batch scheduling anymore, a worker with free resources directly goes to the queue and picks the next pending job from it. Unless there are no pending jobs, there's never a time for a worker to sit idly and wait for a job to be scheduled.
In order to scale out, we'd add more worker processes.
Update now, priorities
The "Update now!" functionality can be implemented by updating the scheduled_for
timestamp and set that to the current time for the particular project. However, the concept can easily be extended to support priorities, too: Add a priority integer
column and also add that to the index. pop(Q)
now orders by priority, scheduled_for, id
. By default, priority
is 0. Any lower priority values would directly promote the job and it gets executed before all other jobs with lower priority.
Discussion
- No need for a explicit job to schedule mirror update jobs.
- Strong consistency with respect to job status - no need to synchronize state across postgres and redis.
- Reduced complexity for job handling/scheduling
- It is easier to make sure we are using resources efficiently.
- We would have to ship a mechanic to install a long-running process (runit) for the worker processes.
For small(er) installations, this still works the same way. When there are no pending mirror updates, the worker simply sleeps.