diff --git a/general/post_deploy_migration/readme.md b/general/post_deploy_migration/readme.md index bc5cef00e6360cb002a4240acf70e1c2e839ab7f..b3067cb936d1f8e61b173ab86f338798a60a54cc 100644 --- a/general/post_deploy_migration/readme.md +++ b/general/post_deploy_migration/readme.md @@ -80,6 +80,36 @@ shift. Take a look at the [Release manager requesting support guide] for more details on getting EOC and dev on-call involved. +#### Check if the failure is due to conflict with an autovacuum process + +Check the job logs of the failed migration job. Search for the following text: +``` +PG::LockNotAvailable: ERROR: canceling statement due to lock timeout +``` + +This can be caused (but not always) by an autovacuum process running on the same table +at the same time. release-tools checks for running autovacuum processes and terminates +the Post Deploy Migration pipeline if found. It is still possible for this conflict +to happen, if the autovacuum process starts after release-tools performs the check, but before the +migrations complete. + +You can check for a running autovacuum process at the following link: +[Thanos query](https://thanos-query.ops.gitlab.net/graph?g0.expr=pg_stat_activity_autovacuum_age_in_seconds%7Btype%3D%22patroni-ci%22%2C%20relname%3D~%22.*wraparound.*%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D) + +If there is an autovacuum process running, there will be a line in the graph and it will touch the right +edge (current time) of the graph. + +If there is an autovacuum process running at the current time, make a note of the `relname` label value. The `relname` +value is the table on which the autovacuum process is running. Check if the failing migration is attempting to obtain a +lock on the same table. If you are unsure about your findings, you can ping `#g_database` on Slack and ask someone to confirm. + +If the autovacuum process is running on the same table as the failing migration, you have two options: +1. Wait for the autovacuum to complete. Autovacuum can take varying amount of time on different tables. +You can get a rough estimate by looking at how long the autovacuum process took on the same table in the last few weeks. +For example, the autovacuum process on the `ci_pipelines` table usually took 2.5 hours at the time of writing. +2. If the autovacuum process is taking too long, or you cannot wait due to any reason, you can move to [Next steps](#next-steps) +and determine if the failure should block deployments. + #### Finding information about the post migration In order to provide the EOC, developers and DBREs the information the need to