wait_for_connection module waits more than timeout specified

Hey everyone,

I have multiple playbooks that runs on a schedule on lots of hosts, some are sometimes turned off for cost saving.

Almost all jobs on AWX are marked as failed because there is at least 1 host that is powered off. Which is not very aesthetically pleasing and also hard to know when a job has actually failed on an important task on a host.

Another inconvenience is that the jobs take a lot of time to execute when there are lots of hosts that are unreachable, because ansible hangs on them and waits for the connection. I tried decreasing the timeout settings in our ansible.cfg to 20 seconds which did help a bit but the hanging on (turned off) hosts take a lot of waiting before the tasks carry on with the other hosts.

The solution for me was:

  • Add a pre_task on each playbook that will run a wait_for_connection task
  • Check if it fails then i end the task without proceeding like so

- hosts: all
gather_facts: no
pre_tasks:
- name: Check host reachability
wait_for_connection:
timeout: "{{ ssh_timeout_wait_for | default(5) }}"
sleep: 1
ignore_errors: true
ignore_unreachable: true
register: host_is_reachable

- name: End play if host is unreachable
meta: end_play
when: host_is_reachable.failed
roles:
- role: roles/somerole

This seems to fix my first problem of jobs been marked as failed if one host is unreachable.

But it does not fix my second problem which is ansible hanging on the unreachable hosts for so long.

In the the wait_for_connection i have set the timeout to 5 seconds, expecting that the ansible should try and reach the host but if it fails to do so in 5 seconds it should end the play. But it doe not do that.

Instead ansible hangs on the unreachable host for more than 2 minutes throws a warning like this:

WARNING]: Unhandled error in Python interpreter discovery for host
172.12.23.34: Failed to connect to the host via ssh: ssh: connect to host

And then waits some extra time and then the output of the wait_for_connection task gets printed like so:

TASK [Check host reachability] *************************************************
fatal: [172.12.23.34]: FAILED! => {“changed”: false,
“elapsed”: 169, “msg”: “timed out waiting for ping module test:
Data could not be sent to remote host "172.12.23.34".
Make sure this host can be reached over ssh: ssh:
connect to host 172.12.23.34 port 22: Connection timed out\r\n”}
…ignoring

As you can see in the task output the wait_for_connection alone waited for 169 seconds even after specifying a way lower value.

Am i doing something wrong? Is this the default behavior?

Extra questions:

  • Is this because ansible tries to facts gather before even starting the wait_for task? that was the reason i put the wait_for_connection in a pre_task.
  • Is the 169 seconds not random and it has to do with the default timeout ssh settings? i get different values every time i run the playbook so i don’t think so.
  • Please share with me any alternative approach to fix to first 2 problems.

Any help would be appreciated :slight_smile:

Several things going on.

#0: Your post/email is dated May 9, but I didn’t see this until Tuesday May 14. This is not related to what you were asking about, but the irony of a “timeout” question taking ~5 days to land is too delicious not to mention.

#1: I have no idea what your ssh_timeout_wait_for variable might be set to. Maybe it isn’t set, so the default(5) may be kicking in. I don’t think it matters in any case, though, because of #2.

#2: Of all the possible modules you could invoke for your “Check host reachability” task, wait_for_connection is perhaps the most opposite of what you’re trying to accomplish. The whole point of that module assumes the host in question is actually down - probably because you just rebooted it in a prior task - and you want to wait for it to come back up before proceeding. And that’s what it’s doing: waiting until the host comes back up or until the end of time (which, fortunately, comes sooner in the lifespan of this task than it does for us out here in the real world). Change this to ansible.builtin.ping (or almost anything else) to get the behavior you seek.

#3: “timeout” is one of the most horribly encapsulated concepts in Ansible (and a lot of other software). It’s used to describe aspects of

  • connections,
  • running times for
  • tasks
  • plays
  • playbooks
  • workflows
  • pauses
  • async task management
  • before iterations
  • between iterations
  • plus whatever crazy foo any particular plugin might want to do with a bit of spare time

  • I say “encapsulated concepts” because, while all those things are mentioned somewhere in the docs, there’s no single place you can look and see them all laid out side by side, compared and contrasted where the interplay between them all is discussed. To be fair, none of those specific docs where some timeout is discussed should be the canonical home of such an overview. That “General Discussion of All Things Timing” page is yet to be written.

To get a feel for ping vs wait_for_connection, consider this snippet of bash script. You’ll need to substitute actual host names for “reachable.host” and “unreachable.host”. The tl;dr (too long, didn’t run) upshot is: you don’t want to use wait_for_connection as a reachability test.

for module in ansible.builtin.ping ansible.builtin.wait_for_connection ; do
  for ct in 2 20 ; do
    printf "module: %s with connection_timeout: %d\n" $module $ct
    time ansible all -i reachable.host,unreachable.host, -m $module -e connection_timeout=$ct -v
  done
done