Hi Ansible Experts !
I am working in a playbook that executes a long running task (patching), that if the session gets disconnected half-way in the execution can cause serious damage in the target server.
Using async module with retries seems the right approach, I hope.
I am now trying to handle the scenario of reporting that we had an “unreachable” error while the async job is running.
We could have intermittent ldap and network issues. Thanks to async, the patch shell script won’t stop running… but I want to inform the engineers (email, dashboard, etc.) that the playbook have actually stopped due to “unreachable” error.
But apparently ignore_unreachable is not working as I expected, when used with retry.
Based in the feedback given in this issue https://github.com/ansible/ansible/issues/78358, ignore_unreachable is not honored with retries.
I am not really looking for “keep retrying” functionality, but at least that I can get a failed task so I can rescue properly afterwards.
This is an example of what I am trying to achieve:
-
name: patching
become: yes
block: -
name: run patching async
async: 43200
poll: 0
shell: my_patch.sh
register: patch_sleeper -
name: wait for async job to end
async_status:
jid: ‘{{ patch_sleeper.ansible_job_id }}’
register: job_result
until: job_result.finished
retries: 720
delay: 1
ignore_unreachable: true -
name: error handling for unreachable in the middle of the run
fail:
msg: Detected unreachable host error. Forcing a fail to trigger any rescue.
when: job_result.unreachable is defined
rescue: -
name: send message to Monitoring Dashboard
my_method:
message: there was an error during patching
And this is an extract of the output I get, that clearly shows that the ignore_unreachable = true is ignored (no pun intended )
(ansible 2.12)
TASK [wait for async job to end] *************************************************************
FAILED - RETRYING: [mynode]: wait for async job to end (720 retries left).
…
FAILED - RETRYING: [mynode]: wait for async job to end (705 retries left).
fatal: [mynode]: UNREACHABLE! => changed=false
msg: 'Failed to connect to the host via ssh: ’
skip_reason: Host mynode is unreachable
unreachable: true
NO MORE HOSTS LEFT *************************************
I explored developing my own custom_async_status, callbacks or action_plugins … but none of those seems capable to change the status from “unreachable” to “failed”.
Is there any way to convert status “unreachable” into a “failed” so that can be rescued? Or to somehow make ignore_unreachable working with retries?
Thanks in advance,
FP