Ansible deployment gets stuck/ends prematurely on unreachable host, doesn't proceed to next host

Hi,

I’m running deployments with serial: 1 because I need to process hosts one by one. My goal is that if a host becomes stuck (e.g., due to an OOM error) or gets terminated (e.g., a spot instance is reclaimed) during a deployment, Ansible should:

  1. Mark that specific host as failed.
  2. Move on to the next host in the inventory.
  3. Not block the entire deployment pipeline.
  4. At the end of the play (after all hosts are attempted), a summary (like the standard PLAY RECAP) should clearly show which hosts succeeded and which failed, and post_tasks should run to determine the overall job status based on a custom threshold of failures.

I’m using the following relevant settings in my play:

  • serial: 1
  • ignore_unreachable: true
  • max_fail_percentage: 100
  • A play-level timeout: 300 (5 minutes) for the main task block.
  • A block/rescue/always structure for the main deployment tasks.
    • The rescue block is intended to catch failures/timeouts and record the failed host.
    • The always block has tasks like unattended-upgrades with ignore_errors: true and a short ansible_connect_timeout (e.g., 3 seconds) to ensure it doesn’t hang on a dead host.
  • post_tasks that run on localhost to summarize and determine the final job status.

The Problem:

When a host becomes unreachable during the execution of tasks within the main block (e.g., a task attempts an SSH connection and it times out, resulting in an UNREACHABLE error for that task):

  • The UNREACHABLE error is correctly ignored at the task level (due to ignore_unreachable: true at the play level).
  • The host is marked as failed=1 in the PLAY RECAP.
  • However, the rescue block associated with the main block does not seem to execute (rescued=0 in PLAY RECAP).
  • Consequently, the always block also doesn’t execute as part of that block/rescue/always structure.
  • Crucially, Ansible does not proceed to the next host in the inventory.
  • The post_tasks are also not executed.
  • The playbook effectively ends after the first host fails with an UNREACHABLE state, showing a PLAY RECAP for only that single host.

It seems like the UNREACHABLE status, even when ignore_unreachable: true is set, is causing the play to terminate its processing for the current serial: 1 batch prematurely, without triggering rescue /always or moving to the next host/post_tasks .

If, instead, a task within the block fails due to a command error (while the host is still reachable) or if the play-level timeout: 300 is hit, the rescue and always blocks do seem to trigger, but the issue of not proceeding to the next host or post_tasks can still occur if the job ends abruptly (e.g., if AWX itself times out the job, which I’ve ruled out by setting no job template timeout).

What I’ve tried/verified:

  • Ensured no AWX Job Template timeout is interfering.
  • Set ansible_connect_timeout to a low value (e.g., 3-5 seconds) for tasks in always and globally via ansible_ssh_common_args .
  • The playbook works correctly and processes all hosts if no hosts fail or become unreachable.

Question: How can I ensure that when a host in a serial: 1 batch becomes UNREACHABLE :

  1. The rescue and always blocks for that host’s main tasks are reliably executed?
  2. Ansible then robustly moves to the next host in the inventory?
  3. The post_tasks are executed after all hosts have been attempted?

Is there a different approach or a known interaction with UNREACHABLE in serial: 1 plays that I might be missing?

Here’s a snippet of the relevant playbook structure:

- hosts: all
  gather_facts: no
  become: true
  max_fail_percentage: 100
  ignore_unreachable: true # At play level
  timeout: 300             # For the main tasks block
  serial: 1
  vars:
    max_allowed_failures: 1 # Example
  # ... pre_tasks to init a bad_hosts list ...

  tasks:
    - block:
        # ... multiple import_role tasks for deployment ...
        - name: "Import deploy role"
          import_role:
            name: deploy_awx # Example role where UNREACHABLE might occur

      rescue:
        - name: "Record failed host"
          set_fact:
            bad_hosts_accumulator: "{{ #... logic ... }}"
          delegate_to: localhost

      always:
        - name: "Final cleanup task (e.g., unattended-upgrades)"
          import_role:
            name: unattended-upgrades
          vars:
            ansible_connect_timeout: 3 # Short timeout for this
          ignore_errors: true

  post_tasks:
    # ... tasks on localhost to summarize and fail job if bad_hosts > max_allowed_failures ...
    - name: "Fail job if too many hosts failed"
      fail:
        msg: "Deployment failed on too many hosts: {{ final_bad_hosts_list }}"
      when: final_bad_hosts_list | length > max_allowed_failures
      run_once: true
      delegate_to: localhost

Any insights or suggestions would be greatly appreciated!

Thanks!

  • Consequently, the always block also doesn’t execute as part of that block/rescue/always structure.
  • Crucially, Ansible does not proceed to the next host in the inventory.

That doesn’t seem right, but I wasn’t able to reproduce. What version of ansible are you using? Unreachable hosts don’t run tasks in the rescue section (as documented), but the always section should run.

Could you have any_errors_fatal set somewhere?

Maybe you could use a fail task at the end of the role that runs if any of the tasks were unreachable, so the host is rescued. Or, in the always if all the tasks in the role use register with a common naming pattern.

- name: Fail unreachable hosts that haven't otherwise failed
  fail:
  when:
    - inventory_hostname not in ansible_failed_hosts
    - |
      tast_1_result is unreachable
      or task_2_result is unreachable
      # ... etc
- name: Count unreachable hosts in the always
  ... some logic ...
  when:
    - inventory_hostname not in ansible_failed_hosts
    - q('vars', *q('varnames', 'deploy_*')) | select('unreachable') | length != 0

Edit: Correction - unreachable hosts are removed from the list of active hosts and do not execute always section tasks. Sorry, I had skimmed some ambiguous documentation without validating it first.

Hi again,

I’ve been running a lot of tests recently, but I’m still unable to get the behavior I’m aiming for. I’m not sure if what I want is fully achievable in Ansible as it stands.

Here’s a simplified example of what I’m trying:

- hosts: all
  gather_facts: no
  become: true
  serial: 1
  vars_files:
    - vars/<project-name>.yml
  collections:
    - <collection>

  tasks:
    - block:
        - name: "SSH Key"
          import_role:
            name: dynamic_ssh_key

        - name: "Import OS role"
          import_role:
            name: os

        - name: "Import aws_metadata role"
          import_role:
            name: aws_metadata

        - name: "Import nodejs role"
          import_role:
            name: geerlingguy_nodejs

        - name: "Import deploy role"
          import_role:
            name: deploy_awx

      always:
        - name: "Enable unattended-upgrades"
          import_role:
            name: unattended-upgrades
          vars:
            unattended_upgrade: "enabled"
          ignore_errors: true

At the play level, I’ve tried various combinations of:

max_fail_percentage: 100
ignore_unreachable: true
timeout: 300

And in the AWX job template, I’m using:

ansible_ssh_common_args: >
  -o ConnectTimeout=5
  -o ServerAliveInterval=15
  -o ServerAliveCountMax=2
ansible_ssh_reuse_connections: false

Despite all this, I still can’t get Ansible to reliably move on to the next host in the inventory when one becomes unreachable. Either the job stops entirely after the first unreachable host, or I configure it to ignore everything and the job falsely reports success for all hosts.

Ideally, I want the play to:

  • Handle unreachable hosts gracefully.
  • Mark them as failed or record them as bad.
  • Move on to the next host (since I’m using serial: 1).
  • Finally, reflect a failed job status in AWX if one or more hosts didn’t complete successfully

Is there any known pattern or workaround that can enforce this kind of “host isolation” behavior, where one failing or unreachable host doesn’t prevent the rest from being processed?

Any insight would be really appreciated — thanks again!

P.S: I’m not using any_errors_fatal

If you’re not trying to abort the play early, serial: 1 (changing the batch size) seems like the wrong option. If a batch fails completely, the play stops instead of running subsequent batches. Have you tried throttle: 1 instead?

Unreachable errors are fatal with serial: 1 because unreachable hosts cannot bring themselves back into the play and there’s no non-failed host in the batch.

In your original post, you’re recalcuating the play-recap, and I don’t understand what’s being added that wouldn’t be part of the default play recap. However, assuming failed hosts aren’t rescued/ignored, unreachable hosts aren’t ignored, and there’s at least 1 successful host to perform the delegation, you could give a custom recap with a set_stats task like this:

  post_tasks:
    - name: Provide custom recap and extra vars for the next playbook in the workflow
      set_stats:
        data:
          failed_hosts: "{{ ansible_failed_hosts }}"
          unreachable_hosts: "{{ ansible_play_hosts_all | difference(ansible_current_hosts) | difference(ansible_failed_hosts) }}"
          successful_hosts: "{{ ansible_current_hosts }}"
      run_once: True
      delegate_to: localhost

The custom recap is displayed if show_custom_stats is configured, but the variables will be available to the next playbook in the workflow either way.