Hi,
I’m running deployments with serial: 1 because I need to process hosts one by one. My goal is that if a host becomes stuck (e.g., due to an OOM error) or gets terminated (e.g., a spot instance is reclaimed) during a deployment, Ansible should:
- Mark that specific host as failed.
- Move on to the next host in the inventory.
- Not block the entire deployment pipeline.
- At the end of the play (after all hosts are attempted), a summary (like the standard PLAY RECAP) should clearly show which hosts succeeded and which failed, and
post_tasksshould run to determine the overall job status based on a custom threshold of failures.
I’m using the following relevant settings in my play:
serial: 1ignore_unreachable: truemax_fail_percentage: 100- A play-level
timeout: 300(5 minutes) for the main task block. - A
block/rescue/alwaysstructure for the main deployment tasks.- The
rescueblock is intended to catch failures/timeouts and record the failed host. - The
alwaysblock has tasks likeunattended-upgradeswithignore_errors: trueand a shortansible_connect_timeout(e.g., 3 seconds) to ensure it doesn’t hang on a dead host.
- The
post_tasksthat run onlocalhostto summarize and determine the final job status.
The Problem:
When a host becomes unreachable during the execution of tasks within the main block (e.g., a task attempts an SSH connection and it times out, resulting in an UNREACHABLE error for that task):
- The
UNREACHABLEerror is correctly ignored at the task level (due toignore_unreachable: trueat the play level). - The host is marked as
failed=1in the PLAY RECAP. - However, the
rescueblock associated with the mainblockdoes not seem to execute (rescued=0in PLAY RECAP). - Consequently, the
alwaysblock also doesn’t execute as part of thatblock/rescue/alwaysstructure. - Crucially, Ansible does not proceed to the next host in the inventory.
- The
post_tasksare also not executed. - The playbook effectively ends after the first host fails with an
UNREACHABLEstate, showing a PLAY RECAP for only that single host.
It seems like the UNREACHABLE status, even when ignore_unreachable: true is set, is causing the play to terminate its processing for the current serial: 1 batch prematurely, without triggering rescue /always or moving to the next host/post_tasks .
If, instead, a task within the block fails due to a command error (while the host is still reachable) or if the play-level timeout: 300 is hit, the rescue and always blocks do seem to trigger, but the issue of not proceeding to the next host or post_tasks can still occur if the job ends abruptly (e.g., if AWX itself times out the job, which I’ve ruled out by setting no job template timeout).
What I’ve tried/verified:
- Ensured no AWX Job Template timeout is interfering.
- Set
ansible_connect_timeoutto a low value (e.g., 3-5 seconds) for tasks inalwaysand globally viaansible_ssh_common_args. - The playbook works correctly and processes all hosts if no hosts fail or become unreachable.
Question: How can I ensure that when a host in a serial: 1 batch becomes UNREACHABLE :
- The
rescueandalwaysblocks for that host’s main tasks are reliably executed? - Ansible then robustly moves to the next host in the inventory?
- The
post_tasksare executed after all hosts have been attempted?
Is there a different approach or a known interaction with UNREACHABLE in serial: 1 plays that I might be missing?
Here’s a snippet of the relevant playbook structure:
- hosts: all
gather_facts: no
become: true
max_fail_percentage: 100
ignore_unreachable: true # At play level
timeout: 300 # For the main tasks block
serial: 1
vars:
max_allowed_failures: 1 # Example
# ... pre_tasks to init a bad_hosts list ...
tasks:
- block:
# ... multiple import_role tasks for deployment ...
- name: "Import deploy role"
import_role:
name: deploy_awx # Example role where UNREACHABLE might occur
rescue:
- name: "Record failed host"
set_fact:
bad_hosts_accumulator: "{{ #... logic ... }}"
delegate_to: localhost
always:
- name: "Final cleanup task (e.g., unattended-upgrades)"
import_role:
name: unattended-upgrades
vars:
ansible_connect_timeout: 3 # Short timeout for this
ignore_errors: true
post_tasks:
# ... tasks on localhost to summarize and fail job if bad_hosts > max_allowed_failures ...
- name: "Fail job if too many hosts failed"
fail:
msg: "Deployment failed on too many hosts: {{ final_bad_hosts_list }}"
when: final_bad_hosts_list | length > max_allowed_failures
run_once: true
delegate_to: localhost
Any insights or suggestions would be greatly appreciated!
Thanks!