Hey everyone,
I have multiple playbooks that runs on a schedule on lots of hosts, some are sometimes turned off for cost saving.
Almost all jobs on AWX are marked as failed because there is at least 1 host that is powered off. Which is not very aesthetically pleasing and also hard to know when a job has actually failed on an important task on a host.
Another inconvenience is that the jobs take a lot of time to execute when there are lots of hosts that are unreachable, because ansible hangs on them and waits for the connection. I tried decreasing the timeout settings in our ansible.cfg to 20 seconds which did help a bit but the hanging on (turned off) hosts take a lot of waiting before the tasks carry on with the other hosts.
The solution for me was:
- Add a pre_task on each playbook that will run a wait_for_connection task
- Check if it fails then i end the task without proceeding like so
- hosts: all
gather_facts: no
pre_tasks:
- name: Check host reachability
wait_for_connection:
timeout: "{{ ssh_timeout_wait_for | default(5) }}"
sleep: 1
ignore_errors: true
ignore_unreachable: true
register: host_is_reachable
- name: End play if host is unreachable
meta: end_play
when: host_is_reachable.failed
roles:
- role: roles/somerole
This seems to fix my first problem of jobs been marked as failed if one host is unreachable.
But it does not fix my second problem which is ansible hanging on the unreachable hosts for so long.
In the the wait_for_connection i have set the timeout to 5 seconds, expecting that the ansible should try and reach the host but if it fails to do so in 5 seconds it should end the play. But it doe not do that.
Instead ansible hangs on the unreachable host for more than 2 minutes throws a warning like this:
WARNING]: Unhandled error in Python interpreter discovery for host
172.12.23.34: Failed to connect to the host via ssh: ssh: connect to host
And then waits some extra time and then the output of the wait_for_connection task gets printed like so:
TASK [Check host reachability] *************************************************
fatal: [172.12.23.34]: FAILED! => {“changed”: false,
“elapsed”: 169, “msg”: “timed out waiting for ping module test:
Data could not be sent to remote host "172.12.23.34".
Make sure this host can be reached over ssh: ssh:
connect to host 172.12.23.34 port 22: Connection timed out\r\n”}
…ignoring
As you can see in the task output the wait_for_connection alone waited for 169 seconds even after specifying a way lower value.
Am i doing something wrong? Is this the default behavior?
Extra questions:
- Is this because ansible tries to facts gather before even starting the wait_for task? that was the reason i put the wait_for_connection in a pre_task.
- Is the 169 seconds not random and it has to do with the default timeout ssh settings? i get different values every time i run the playbook so i don’t think so.
- Please share with me any alternative approach to fix to first 2 problems.
Any help would be appreciated