Published on: August 4, 2022
6 min read
This is the second in our three-part technical summary of the yearlong project to decompose GitLab's Postgres database.
This blog post is part 2 in a three-part series about decomposing the GitLab backend database. It focuses on the final migration process and highlights the results we achieved after the migration. If you want to read about the design and planning phase, check out part 1.
Early on in the project we thought it would be necessary for the migration to be "zero downtime" or "near-zero downtime". We came up with this plan early on which involved (in summary):
This approach (assuming that the CI replicas were only delayed by a few seconds) would mean that, at most, there would be a few seconds where CI writes might result in errors and 500s for users. Many failures would likely already be retried since much of CI write traffic goes via asynchronous (Sidekiq) processes that automatically retry.
In the end we didn't use this approach because:
The migration approach we ended up using took two hours of downtime. We stopped all GitLab services that could read or write from the database. We also blocked user-level traffic at the CDN (Cloudflare) to allow us to do some automated and manual testing before opening traffic back up to users. This allowed us to prepare a slightly more straightforward rollback procedure, which was:
Ultimately having a simple rollback mechanism proved very useful in doing many practice runs on staging.
Before executing the final migration on GitLab.com, we executed seven rehearsals with rollback and one final migration on our staging environment. In these practice runs, we discovered many small issues that would have likely caused issues in the production environment.
These rehearsals also gave all the participants an opportunity to perfect their steps in the process to minimize delays in our production rollout. This practice ultimately allowed us to be quite confident in our timeline of at most two hours of downtime.
In the end, we finished the migration in 93 minutes, with a few small delays caused by surprises we did not see in staging.
The rehearsal process was very time-consuming and a vast effort to execute in the context of GitLab, where we all work asynchronously and across different timezones. However, it proved to be essential to the success of this project.
One week before our the final migration on production we prepared a production readiness review issue for final approval from executives. This was a good opportunity to highlight all the preparation and validation we'd done to give us confidence in the plan. This also encouraged us to do extra validation where we might expect to see questions or concerns about the plan.
Some highlights from this review included:
After we completed the rollout we tracked performance improvements across some metrics we expected to improve.
The data showed:
You can read more about some interesting technical challenges and surprises we had to deal with along the way in part 3.
Enjoyed reading this blog post or have questions or feedback? Share your thoughts by creating a new topic in the GitLab community forum.
Share your feedback