How to properly manage pg_backup_start across Ansible tasks for consistent snapshots?

Hello!

I’m building a playbook that needs to effectively perform a backup-and-restore style process using filesystem-level snapshots of a PostgreSQL server (in this case the snapshots are done with AWS), and I want to ensure the database is in a consistent state using pg_backup_start.

My understanding is that pg_backup_start only keeps the database in backup mode for the duration of that specific session (or until pg_backup_stop is run - source: https://pgpedia.info/p/pg_backup_start.html). This is giving me trouble with automating this process in Ansible, as it appears each task runs in a new, separate session.

This means if I have one task to call pg_backup_start and a second task to take the snapshot, the backup mode seems to have been disabled before the snapshot is actually taken. This creates a potential race condition.

Here’s a simplified look at my playbook structure:

- name: Snapshot Postgres
  hosts: <whatever>
  become: true

  tasks:
    - name: Enable PostgreSQL backup mode (pg_backup_start)
      # Executes the pg_backup_start function inside the PostgreSQL container.
      # This prepares the database for a consistent filesystem-level backup (snapshot).
      community.docker.docker_container_exec:
        container: "{{ augur_pg_container_name }}"
        command: "psql -U {{ augur_pg_container_user }} -c \"SELECT pg_backup_start('{{ backup_label }}');\""
        # The command is escaped correctly for shell execution within docker_container_exec.
        # It calls psql, connects as 'postgres' user, and executes the SQL function.
      environment:
        PGPASSWORD: "{{ augur_pg_password }}" # Pass the password securely via environment variable
      register: pg_backup_start_result
      # Check for successful output from psql. pg_backup_start returns a text string.
      failed_when: "'pg_backup_start' not in pg_backup_start_result.stdout"
      changed_when: true # This task changes the state of the database

    # Is the database still in backup mode here, after the previous task's session ended?

    - name: Take filesystem snapshot (e.g., LVM, ZFS)
        # ... snapshot module and parameters ...

    - name: Disable PostgreSQL backup mode (pg_backup_stop)
      # Executes the pg_backup_stop function inside the PostgreSQL container.
      # This finalizes the backup process and flushes any remaining WAL data.
      community.docker.docker_container_exec:
        container: "{{ augur_pg_container_name }}"
        command: "psql -U {{ augur_pg_container_user }} -c \"SELECT pg_backup_stop();\""
      environment:
        PGPASSWORD: "{{ augur_pg_password }}"
      register: pg_backup_stop_result
      # Check for successful output from psql.pg_backup_stop returns a text string.
      failed_when: "'pg_backup_stop' not in pg_backup_stop_result.stdout"
      changed_when: true # This task changes the state of the database
      # This task should be executed *after* the EBS snapshot has completed.

My main question is: What is the correct Ansible-native way to execute pg_backup_start, take a snapshot, and then run pg_backup_stop while ensuring the database remains in backup mode until the snapshot has been created?

I’m trying to avoid relying on PostgreSQL’s self-recovery on restore, as I’d prefer to guarantee the snapshot is perfectly consistent from the start.

Any Ansible or PostgreSQL magic that could help here would be greatly appreciated! In theory i could maybe achieve this with a custom module, but this feels like a common enough usecase that I suspect someone else may have already solved this before

1 Like

You are missing an important part/step: you also need all WAL-files from the pg_backup_start() call until the pg_backup_stop() call, and the latter will also rotate the last logfile for you.

The snapshot you are taking will always be like a cold restart of the server: all ongoing transaction at the time of the snapshot are rolled back, and the database will run a full recovery upon restart.

1 Like

Ah yep, Ive been meaning to add that too. I just figured that running the playbook and having the stop backup step fail with a message saying no backup is in progress was the bigger/more problematic issue since it stops any future steps in the backup job (like where copying WAL would go).

Is the problem because I’m running these postgres commands through psql and/or docker when a better ansible module exists?

I suggest that you first try to understand the fundamentals how the backup works on a local PostgreSQL installation, no Ansible, no automation.

Works like this, as very basic steps:

  • You need to have archive_command enabled, and copy the WAL files away
  • You run pg_start_backup()
  • You take a backup of the data directory (usually with a basebackup, but snapshot works - with additional steps)
  • You run pg_stop_backup()
  • You still copy the WAL files

When you restore, you need to setup restore_command, which will fetch the WAL files from your external location. When you start the database on the snapshot/basebackup, if will find that it’s basically a cold restart, and will use restore_command to fetch the WAL files needed from start of the backup to end of the backup (and further) and restore the transactions.

In your case you don’t have the copy of the WAL files, you don’t have archive_command and you don’t have restore_command. You end up with a cold restart, the database will go back to the last CHECKPOINT and restore as many transactions from the WAL files in pg_wal as possible. Every transaction which is not committed when the recovery reaches the end of available WAL will be rolled back. That’s it.

1 Like