Hi,
I’m running deployments with serial: 1
because I need to process hosts one by one. My goal is that if a host becomes stuck (e.g., due to an OOM error) or gets terminated (e.g., a spot instance is reclaimed) during a deployment, Ansible should:
- Mark that specific host as failed.
- Move on to the next host in the inventory.
- Not block the entire deployment pipeline.
- At the end of the play (after all hosts are attempted), a summary (like the standard PLAY RECAP) should clearly show which hosts succeeded and which failed, and
post_tasks
should run to determine the overall job status based on a custom threshold of failures.
I’m using the following relevant settings in my play:
serial: 1
ignore_unreachable: true
max_fail_percentage: 100
- A play-level
timeout: 300
(5 minutes) for the main task block. - A
block/rescue/always
structure for the main deployment tasks.- The
rescue
block is intended to catch failures/timeouts and record the failed host. - The
always
block has tasks likeunattended-upgrades
withignore_errors: true
and a shortansible_connect_timeout
(e.g., 3 seconds) to ensure it doesn’t hang on a dead host.
- The
post_tasks
that run onlocalhost
to summarize and determine the final job status.
The Problem:
When a host becomes unreachable during the execution of tasks within the main block
(e.g., a task attempts an SSH connection and it times out, resulting in an UNREACHABLE
error for that task):
- The
UNREACHABLE
error is correctly ignored at the task level (due toignore_unreachable: true
at the play level). - The host is marked as
failed=1
in the PLAY RECAP. - However, the
rescue
block associated with the mainblock
does not seem to execute (rescued=0
in PLAY RECAP). - Consequently, the
always
block also doesn’t execute as part of thatblock/rescue/always
structure. - Crucially, Ansible does not proceed to the next host in the inventory.
- The
post_tasks
are also not executed. - The playbook effectively ends after the first host fails with an
UNREACHABLE
state, showing a PLAY RECAP for only that single host.
It seems like the UNREACHABLE
status, even when ignore_unreachable: true
is set, is causing the play to terminate its processing for the current serial: 1
batch prematurely, without triggering rescue
/always
or moving to the next host/post_tasks
.
If, instead, a task within the block fails due to a command error (while the host is still reachable) or if the play-level timeout: 300
is hit, the rescue
and always
blocks do seem to trigger, but the issue of not proceeding to the next host or post_tasks
can still occur if the job ends abruptly (e.g., if AWX itself times out the job, which I’ve ruled out by setting no job template timeout).
What I’ve tried/verified:
- Ensured no AWX Job Template timeout is interfering.
- Set
ansible_connect_timeout
to a low value (e.g., 3-5 seconds) for tasks inalways
and globally viaansible_ssh_common_args
. - The playbook works correctly and processes all hosts if no hosts fail or become unreachable.
Question: How can I ensure that when a host in a serial: 1
batch becomes UNREACHABLE
:
- The
rescue
andalways
blocks for that host’s main tasks are reliably executed? - Ansible then robustly moves to the next host in the inventory?
- The
post_tasks
are executed after all hosts have been attempted?
Is there a different approach or a known interaction with UNREACHABLE
in serial: 1
plays that I might be missing?
Here’s a snippet of the relevant playbook structure:
- hosts: all
gather_facts: no
become: true
max_fail_percentage: 100
ignore_unreachable: true # At play level
timeout: 300 # For the main tasks block
serial: 1
vars:
max_allowed_failures: 1 # Example
# ... pre_tasks to init a bad_hosts list ...
tasks:
- block:
# ... multiple import_role tasks for deployment ...
- name: "Import deploy role"
import_role:
name: deploy_awx # Example role where UNREACHABLE might occur
rescue:
- name: "Record failed host"
set_fact:
bad_hosts_accumulator: "{{ #... logic ... }}"
delegate_to: localhost
always:
- name: "Final cleanup task (e.g., unattended-upgrades)"
import_role:
name: unattended-upgrades
vars:
ansible_connect_timeout: 3 # Short timeout for this
ignore_errors: true
post_tasks:
# ... tasks on localhost to summarize and fail job if bad_hosts > max_allowed_failures ...
- name: "Fail job if too many hosts failed"
fail:
msg: "Deployment failed on too many hosts: {{ final_bad_hosts_list }}"
when: final_bad_hosts_list | length > max_allowed_failures
run_once: true
delegate_to: localhost
Any insights or suggestions would be greatly appreciated!
Thanks!