Potential Praefect single point of failure with it's database connection?
While performance testing Gitaly Cluster and Praefect as part of the Reference Architectures I noticed what could be a potential single point of failure in the current recommended approach.
Specifically there's a potential non fault tolerance with the main connection between Praefect and it's database (this is the main connection and not the new additional one added for caching).
The current docs detail two ways to connect Praefect to it's database - Directly or via PgBouncer. As far as I understand these they will both be a single point of failure if the specific Postgres node Praefect is configured to use (either directly or via PgBouncer) goes down.
For the main GitLab database automated failover is provided with a combination of PgBouncer and Consul, with the latter automatically updating PgBouncer if Postgres goes down:
Consul agent - Watches the status of the PostgreSQL service definition on the Consul cluster. If that status changes, Consul runs a script which updates the PgBouncer configuration to point to the new PostgreSQL master node and reloads the PgBouncer service
This isn't currently documented for Praefect so it's unknown if it is supported or how to do it with the Omnibus bundled PgBouncer service (something that most customers will want as a convenience).