[go: up one dir, main page]

20251118:05 - Explore Barman for PostgreSQL Backup Strategy

Context

Related to customer ticket Zen: 672154 where backup operations on a PostgreSQL replica were causing production issues (artifact upload failures with 403 errors).

Objective

Evaluate and implement Barman (Backup and Recovery Manager) as a PostgreSQL backup solution for bare metal GitLab Omnibus installations.

Current Situation

  • Infrastructure: Bare metal servers with local SSD storage
  • PostgreSQL Setup: 5 replicas in HA configuration (ultimately, but begin with single-server)
  • Requirements: Disaster Recovery with monthly recovery testing

Why Barman?

  • Missing Backup solution for self-managed database > 100GB
  • Designed for bare metal PostgreSQL deployments
  • Continuous archiving without pausing replication
  • Point-in-time recovery (PITR) capabilities
  • Minimal impact on production
  • Built-in recovery testing features
  • Incremental backups

There's a great overview of Barman in this 35 minute video which covers good general practices, but does not go into detail on configuration.

Why not just use Geo? Geo is explicitly NOT a backup solution. From the Gitaly capabilities documentation: "Geo is not intended to replace other backup/restore solutions. Because of replication lag and the possibility of replicating bad data from a primary, customers should also take regular backups of their primary site and test the restore process."

Implementation Plan

1. Prerequisites

  • Dedicated backup server or designate one of the 5 replicas
  • Sufficient storage for 30-day retention + WAL archives
  • Network connectivity from primary to Barman server

2. User Setup

Create two users for Barman operations:

Streaming replication user (auto-created via gitlab.rb):

postgresql['sql_replication_user'] = "barman_streaming"
postgresql['sql_replication_password'] = "<secure-password>"

Important: After reconfigure, manually reset the password due to GitLab auto-creation issues:

sudo gitlab-psql -d postgres -c "ALTER ROLE barman_streaming WITH PASSWORD '<secure-password>';"

Superuser (manual creation required):

sudo gitlab-psql -d gitlabhq_production
CREATE ROLE barman LOGIN PASSWORD '<secure-password>' SUPERUSER VALID UNTIL 'infinity';
\q

3. Primary PostgreSQL Configuration

Critical: GitLab's bundled PostgreSQL defaults to Unix sockets only. Network listening must be enabled for Barman.

# /etc/gitlab/gitlab.rb

# Enable network listening for Barman (REQUIRED)
postgresql['listen_address'] = '0.0.0.0'
postgresql['port'] = 5432

# Replication settings
postgresql['wal_level'] = "replica"
postgresql['max_wal_senders'] = 10
postgresql['max_replication_slots'] = 5

# Archive mode must be enabled (but we use streaming, not archive_command)
postgresql['archive_mode'] = 'on'
postgresql['archive_command'] = '/bin/true'  # No-op command

# Streaming replication user
postgresql['sql_replication_user'] = "barman_streaming"
postgresql['sql_replication_password'] = "<secure-password>"

# Access control for Barman
# IMPORTANT: Use scram-sha-256, not md5 (matches PostgreSQL password storage)
postgresql['custom_pg_hba_entries'] = {
  'BARMAN': [
    {
      type: 'host',
      database: 'all',
      user: 'barman',
      cidr: '<BARMAN_SERVER_IP>/32',
      method: 'scram-sha-256'
    },
    {
      type: 'host',
      database: 'replication',
      user: 'barman_streaming',
      cidr: '<BARMAN_SERVER_IP>/32',
      method: 'scram-sha-256'
    }
  ]
}

Two-step reconfigure process (required when enabling network listening):

# Stop all services
sudo gitlab-ctl stop

# Start only PostgreSQL with new config
sudo gitlab-ctl start postgresql

# Verify it's listening on network
sudo ss -tlnp | grep 5432
# Should show: LISTEN 0.0.0.0:5432

# Now reconfigure can succeed
sudo gitlab-ctl reconfigure

# Start remaining services
sudo gitlab-ctl start

4. Barman Server Setup

# Install Barman
apt-get install barman postgresql-client

# Configure Barman
# /etc/barman.d/<SERVER_NAME>.conf (e.g., gitlab-gl3.conf)
[<SERVER_NAME>]
description = "GitLab Production Database"
conninfo = host=<GITLAB_SERVER_IP> user=barman password=<barman-password> dbname=gitlabhq_production
streaming_conninfo = host=<GITLAB_SERVER_IP> user=barman_streaming password=<streaming-password>
backup_method = postgres
streaming_archiver = on
slot_name = barman
retention_policy = RECOVERY WINDOW OF 30 DAYS

5. Security: Use .pgpass

# On Barman server, as barman user
echo "<GITLAB_SERVER_IP>:5432:gitlabhq_production:barman:<barman-password>" >> ~/.pgpass
echo "<GITLAB_SERVER_IP>:5432:replication:barman_streaming:<streaming-password>" >> ~/.pgpass
chmod 600 ~/.pgpass

6. Initialize Barman

# On Barman server
barman check <SERVER_NAME>

# Create replication slot
barman receive-wal --create-slot <SERVER_NAME>

# Start WAL streaming (run in background or via cron)
barman receive-wal <SERVER_NAME> &

7. Automate with Cron

Set up barman cron to automatically manage receive-wal and other maintenance tasks:

# On Barman server, as barman user
sudo -u barman crontab -e

# Add this line to run every minute:
* * * * * /usr/bin/barman cron

The cron job will:

  • Automatically start/monitor receive-wal processes
  • Archive incoming WAL files
  • Enforce retention policies
  • Survive server restarts

8. Take Backups

# On Barman server
barman backup <SERVER_NAME>

# List backups
barman list-backup <SERVER_NAME>

# Show backup details
barman show-backup <SERVER_NAME> <BACKUP_ID>

9. Test Recovery

# Restore to local directory
barman recover <SERVER_NAME> <BACKUP_ID> /path/to/recovery

# Or restore to remote server (requires SSH keys)
barman recover <SERVER_NAME> <BACKUP_ID> /path/to/recovery --remote-ssh-command "ssh postgres@recovery-server"

Lessons Learned

PostgreSQL Network Listening

  • GitLab's bundled PostgreSQL defaults to Unix sockets only
  • Must configure postgresql['listen_address'] to enable network connections
  • Requires two-step reconfigure process (stop, start PostgreSQL, reconfigure, start all)
  • GitLab Rails can continue using Unix socket for better performance

Authentication Method

  • PostgreSQL stores passwords as SCRAM-SHA-256 by default
  • pg_hba.conf must use scram-sha-256 method, not md5
  • Mismatch causes authentication failures even with correct password

User Creation

  • GitLab's automatic user creation via postgresql['sql_replication_password'] may not set password correctly
  • Manually reset password after reconfigure using ALTER ROLE

WAL Archiving Methods

  • Streaming archiver (recommended): Barman pulls WAL via streaming replication
    • Simpler setup, no SSH keys or barman-cli needed on GitLab server
    • Configure with streaming_archiver = on in Barman config
  • Traditional archiver: PostgreSQL pushes WAL via archive_command
    • Requires barman-cli on GitLab server and SSH keys
    • More complex, not needed when using streaming
  • Don't use both methods - they're redundant and cause confusion

Directory Permissions

  • /var/lib/barman/ - Mode 0700 (barman:barman)
  • /etc/barman.d/ - Mode 0750 (root:barman)
  • /var/log/barman/ - Mode 0755 (barman:barman)

Backup Troubleshooting

  • If backup hangs with BackupWaitWalArchive, check archive_command configuration
  • Force WAL switch to help stuck backups: sudo gitlab-psql -c "SELECT pg_switch_wal();"
  • Check walsender status: sudo gitlab-psql -c "SELECT * FROM pg_stat_activity WHERE backend_type = 'walsender';"

Server Restarts

  • receive-wal process doesn't automatically restart after reboot
  • Use barman cron to automatically manage processes
  • Password authentication may fail after restart - needs investigation

GitLab Geo Compatibility

If using GitLab Geo, increase capacity:

postgresql['max_wal_senders'] = 15  # Accommodate both Barman and Geo
postgresql['max_replication_slots'] = 10
postgresql['max_slot_wal_keep_size'] = '10GB'  # PostgreSQL 13+

Barman and Geo use separate replication slots and don't interfere with each other.

Monitoring

-- Check replication slots
SELECT slot_name, slot_type, active, restart_lsn 
FROM pg_replication_slots;

-- Check for inactive slots (can cause disk space issues)
SELECT
    slot_name, active, 
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots 
WHERE NOT active;

-- Monitor WAL senders
SELECT * FROM pg_stat_replication;

Key Decisions

  • Dedicated Barman server (using separate VM)
  • Backup retention period: 30 days
  • Backup schedule (continuous WAL archiving + periodic base backups via cron)
  • Storage capacity requirements (3x database size for testing)
  • WAL archiving method: Streaming (not traditional archive_command)

Alternative Options Considered

  1. pgBackRest - More complex but powerful compression/deduplication
  2. GitLab Geo - Provides DR but not point-in-time recovery
  3. pg_basebackup - Simpler but less feature-rich

References

Edited by Mike Lockhart | GitLab