Trying to speed up ansible with serial running, but it hangs after a while constantly

lilian-granqvist · February 19, 2025, 4:14pm

We have some hundred servers I try to run ansible script against, if I’ve understood correctly it by default runs one host at a time, and with serial you can increase this. depending on if I have 100% or 10 as serial it just crashes at some point.

With serial: 10 i had 10 ssh connections constantly until when it hanged, after which i had only 1 and network usage spiked

---
- name: "Add or remove users"
  hosts: tag_ansible_managed_true
  # serial:
  #  - 10
  become: true
  gather_facts: true
  strategy: free # noqa: run-once[play]
  roles:
    - amazon_user_accounts
...

lilian-granqvist · February 19, 2025, 4:47pm

Seems it’s not serial, but it’s always at the exact same position and ends in a situation like in the screenshot

(at ctrl+c point ansible had been running for 37m 44s possibility of some timeout at 35min ? )

sivel · February 19, 2025, 5:32pm

It could be that you are experiencing network issues, and although the SSH connection is dead, the ssh client hasn’t recognized it. You may want to add -o ServerAliveInterval=60s or some other value to your ssh_extra_args.

It’s hard to say without more info. You might try using strace to see if you can determine what activity is happening in the processes.

mcen1 · February 19, 2025, 5:50pm

Tough to say without more info but one thing I’ve encountered in the past is a dead NFS or NAS mount on a target machine causing my job to hang during the gather_facts stage

lilian-granqvist · February 19, 2025, 5:57pm

Yeah i found some tcpkeepalive and serveraliveinterval options i tried to add
Currently ssh_args looks like this:

[ssh_connection]
ssh_args=-o ServerAliveInterval=30 -o ServerAliveCountMax=3 -o TCPKeepAlive=yes -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=/tmp/ansible-ssh-%h-%p-%r

It always dies after the first of these two tasks are complete like 50servers in

TASK [amazon_user_accounts : Configure /etc/sudoers.d/] ***************************************************************************************************

TASK [amazon_user_accounts : Set authorized_key for the user (exclusive)] *********************************************************************************

Trying to strace the process, i’ll inform if i find anything more

lilian-granqvist · February 19, 2025, 6:16pm

not sure what i’m looking for in strace, but when ansible hangs up the process just spams nonstop this line :

20:08:06 clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, {tv_sec=28132, tv_nsec=516224116}, NULL) = 0

Depending on batch sizes it always stops after the same servers task is complete and is supposed to start next task

sivel · February 19, 2025, 7:02pm

You might have to log into the affected server that ssh is still open to, and determine what the module execution is doing.

Ultimately it sounds like ansible is waiting to get a module result back from a server that is never coming. When that happens it’s usually 1 of 2 things:

The connection is dead and ssh hasn’t noticed
the module on the remote side is hung for some reason

If that result never comes, ansible will appear to just hang.

You can check out this gist for attempting to diagnose why a module execution may be happening on a target host: How to debug a hanging ansible module · GitHub

lilian-granqvist · February 19, 2025, 7:24pm

I found the issue , somewhere at somepoint ssh tried to ask me if i trust a new servers ip adress i hadn’t yet connected to…
(i saw this after i outputted the ansible stuff to a file instead of console)

Playbook run took 0 days, 0 hours, 11 minutes, 58 seconds with a serial batch size of 10 and 10 forks.

I think i’ll try and make a loop before the main task to go through all servers and see that i can connect to them, should be more easily visible then i think

hugonz · February 20, 2025, 11:16pm

You can avoid that if you set
$ export ANSIBLE_HOST_KEY_CHECKING=False

https://docs.ansible.com/ansible/latest/inventory_guide/connection_details.html#managing-host-key-checking

system · March 22, 2025, 11:17pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
SSH connections to EC2 hang sporadically Ansible Project	4	16	November 17, 2014
Hang issue when use ansible Ansible Project	1	14	March 19, 2014
execute command using ansible hang Ansible Project	0	16	August 25, 2016
Issues with playbook ran getting stuck on gathering facts Ansible Project	14	219	October 29, 2015
Changed behavior with Serial? Ansible Developer	0	8	February 19, 2016

Trying to speed up ansible with serial running, but it hangs after a while constantly

Related topics