Backup restore rake task isn't handling read-only errors

We have multiple reports from customers having problems while restoring a backup to a Gitaly + Praefect cluster.

During the restore task, multiple repositories fail to be restored raising the following error:

[Failed] restoring group/repo (@hashed/eb/0c/eb0c9cdcl33tl33tr3d4c73d798bae1162fe27f18d482c)

Error 9:repository is in read-only mode. debug_error_string:{"created":"@1616683445.784983667","description":"Error received from peer ipv4:x.x.x.x:2305","file":"src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"repository is in read-only mode","grpc_status":9}

All raised by Praefect and by the same RPC: CreateRepositoryFromBundle. The end result is an instance with a number of missing repositories.

The repositories going into read-only mode during the restore task is investigated in a separate issue, but the restore rake task itself should be robust enough to be able to deal with such failures, by either handling the error or implementing a retry mechanism to prevent an incomplete end result.

Workaround

Restore to a Praefect with a "clean" database. Do not do this with a functional production cluster. #3546 (comment 546966263)

Possible Solutions

Investigate clearing the Praefect DB from the restore task.
#3485 (closed) would address this as a long term solution to the root cause. #3546 (comment 552349407)

Edited Apr 22, 2021 by Nick Nguyen