Repository pull mirror scheduling design

This is a design proposal to make the way we schedule pull mirror updates more efficient and scalable.

The summary is:

Current scheduling strategy is complex because

we use a batch-style cronjob but actually need higher frequency than once per minute
there's no feedback loop when jobs finish, making it harder to efficiently use resources.

The proposal here is to use a simple postgres-based queuing solution.

Background / Motivation

GitLab.com has about 110,000 mirrors set up. We aim to update each of these mirrors at least every 30 minutes. Additionally, we allow users to individually request an immediate update of their mirrors.

We are currently facing issues with Redis (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7157) and with delays in pull mirroring specifically (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7173).

Quoting from the root-cause analysis in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7173:

Why? New mirrors were becoming "ready to update" at a higher rate than we were updating mirrors that were already ready to update Why? We were updating mirrors at a lower rate than our pullmirror nodes are capable of Why? There were too few RepositoryUpdateMirrorWorker jobs in the repository_update_mirror queue, leaving pullmirror nodes sitting idly by when they could be updating mirrors Why? UpdateAllMirrorsWorker couldn't schedule mirror updates fast enough to keep up with pullmirror nodes becoming ready for new work Why? The LPUSH Redis operation used to push new jobs onto the queue took significantly more time to complete than it would and has under healthier Redis conditions

We were not able to schedule mirror updates fast enough to keep up with mirrors becoming eligible for updates.

Current design

There is an awesome deep-dive video over here: https://www.youtube.com/watch?v=sSZq0fpdY-Y

With the current implemention, there are 3 types of jobs:

Simplified, UpdateAllMirrorsWorker is a Sidekiq cronjob run every minute. It picks the next batch of jobs and enqueues them in Redis (ProjectImportScheduleWorker jobs). It waits until the current batch completed its jobs, effectively waiting for the slowest job of the batch to complete (background). It then re-schedules itself eagerly, so that the scheduling job is actually run at a higher frequency than every minute. This is because the minimum cronjob frequency for Sidekiq is one minute.

The ProjectImportScheduleWorker job basically creates a state machine record (ProjectImportState) to follow the job's execution. It then schedules another job of type RepositoryUpdateMirrorWorker which performs the actual work of updating mirrors.

The goal of UpdateAllMirrorsWorker is to always schedule enough jobs such that the Sidekiq workers always have something to work on. On the other hand, it tries to also minimize the queue length. This is because we need to be able to respond to the "Update now!" request from users and work on those jobs first.

The scheduling logic in UpdateAllMirrorsWorker is quite complex. I believe this is because of an impedance mismatch: On the worker side, we want to maximize resource utilization and always keep workers busy. However, there's no feedback loop when a job has finished. It is up to the cronjob-style scheduling job to determine whether or not new jobs should be enqueued. This is a batch-job which runs at fixed intervals, whereas we should be taken these decisions in a streaming fashion in order avoid unused resources.

With >110k mirrors configured currently, we end up running >220k sidekiq jobs plus the UpdateAllMirrorsWorker job per 30 minutes, translating to more than 120 sidekiq jobs per second.

Abstract model

We have a large number (>110k) of jobs we want to execute at least every 30 minutes. This translates to an average rate of about 60 jobs per second. In addition to regular scheduled updates, users may request immediate updates for individual repositories ("Update now").

We can model this as a priority queue Q: A job has a desired next execution timestamp. Jobs are being queued in order of said timestamp. A pop(Q) retrieves the next job to be executed and removes the job from the queue. Once the job has finished, it is rescheduled to run in 30 minutes and put back to the queue.

We employ n number of worker processes to consume from this queue and run jobs.

The goals for this model are:

Retrieving the next job from the queue with pop(Q) needs to be efficient in O(1)
In order to efficiently use resources, each of the n processes should always have work to do (unless the queue is empty).
Ability to scale out to be able to provide updates to all mirrors every 30 minutes (or less).

Proposed design

The proposal here is to not use sidekiq to handle those recurring jobs. Instead, we run n worker processes which consume from a Postgres-based queue implementation. There is no need for a centralized scheduler as those workers coordinate amongst themselves without much overhead.

Queue implementation

In order to implement a queue in postgres, a FOR UPDATE SKIP LOCKED pattern can be used to efficiently implement pop(Q). Let's illustrate with this example:

CREATE TABLE jobs (id serial primary key, scheduled_for timestamp with time zone not null);
CREATE INDEX ON jobs (scheduled_for, id);

Jobs would reside in jobs table with a timestamp scheduled_for. After current time has passed this timestamp, the job is eligible to be executed.

In order to retrieve the next job (pop(Q)) from the queue, we do the following:

BEGIN;
SELECT * FROM jobs WHERE scheduled_for <= NOW() ORDER BY scheduled_for, id LIMIT 1 FOR UPDATE SKIP LOCKED; 
-- execute job's work
UPDATE jobs SET scheduled_for = NOW() + 30.minutes WHERE id = ...
COMMIT;

With an index in place, retrieving the next job is in O(1) (or the number of currently executing jobs because we scan the index until we find the next record that hasn't been locked yet). More work needs to be done on the database-side when the job is rescheduled - this requires an index update in O(log N) (N: number of jobs) and an update to the record itself (which results in dead tuples).

In order to limit transaction length, the job's work should be done outside the transaction but I'll simplify over that for now.

After the job has finished, we update the scheduled_for timestamp to the next desired execution timestamp, which effectively puts the job back into the queue.

Worker efficiency, scaling out, robustness

Since there's no batch scheduling anymore, a worker with free resources directly goes to the queue and picks the next pending job from it. Unless there are no pending jobs, there's never a time for a worker to sit idly and wait for a job to be scheduled.

In order to scale out, we'd add more worker processes.

Update now, priorities

The "Update now!" functionality can be implemented by updating the scheduled_for timestamp and set that to the current time for the particular project. However, the concept can easily be extended to support priorities, too: Add a priority integer column and also add that to the index. pop(Q) now orders by priority, scheduled_for, id. By default, priority is 0. Any lower priority values would directly promote the job and it gets executed before all other jobs with lower priority.

Discussion

No need for a explicit job to schedule mirror update jobs.
Strong consistency with respect to job status - no need to synchronize state across postgres and redis.
Reduced complexity for job handling/scheduling
It is easier to make sure we are using resources efficiently.
We would have to ship a mechanic to install a long-running process (runit) for the worker processes.

For small(er) installations, this still works the same way. When there are no pending mirror updates, the worker simply sleeps.

Edited Jul 12, 2019 by Andreas Brandl