Trying to speed up ansible with serial running, but it hangs after a while constantly

We have some hundred servers I try to run ansible script against, if I’ve understood correctly it by default runs one host at a time, and with serial you can increase this. depending on if I have 100% or 10 as serial it just crashes at some point.

With serial: 10 i had 10 ssh connections constantly until when it hanged, after which i had only 1 and network usage spiked

---
- name: "Add or remove users"
  hosts: tag_ansible_managed_true
  # serial:
  #  - 10
  become: true
  gather_facts: true
  strategy: free # noqa: run-once[play]
  roles:
    - amazon_user_accounts
...

Seems it’s not serial, but it’s always at the exact same position and ends in a situation like in the screenshot

(at ctrl+c point ansible had been running for 37m 44s possibility of some timeout at 35min ? )

It could be that you are experiencing network issues, and although the SSH connection is dead, the ssh client hasn’t recognized it. You may want to add -o ServerAliveInterval=60s or some other value to your ssh_extra_args.

It’s hard to say without more info. You might try using strace to see if you can determine what activity is happening in the processes.

2 Likes

Tough to say without more info but one thing I’ve encountered in the past is a dead NFS or NAS mount on a target machine causing my job to hang during the gather_facts stage

Yeah i found some tcpkeepalive and serveraliveinterval options i tried to add
Currently ssh_args looks like this:

[ssh_connection]
ssh_args=-o ServerAliveInterval=30 -o ServerAliveCountMax=3 -o TCPKeepAlive=yes -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=/tmp/ansible-ssh-%h-%p-%r

It always dies after the first of these two tasks are complete like 50servers in

TASK [amazon_user_accounts : Configure /etc/sudoers.d/] ***************************************************************************************************

TASK [amazon_user_accounts : Set authorized_key for the user (exclusive)] *********************************************************************************

Trying to strace the process, i’ll inform if i find anything more

not sure what i’m looking for in strace, but when ansible hangs up the process just spams nonstop this line :

20:08:06 clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME, {tv_sec=28132, tv_nsec=516224116}, NULL) = 0

Depending on batch sizes it always stops after the same servers task is complete and is supposed to start next task

You might have to log into the affected server that ssh is still open to, and determine what the module execution is doing.

Ultimately it sounds like ansible is waiting to get a module result back from a server that is never coming. When that happens it’s usually 1 of 2 things:

  1. The connection is dead and ssh hasn’t noticed
  2. the module on the remote side is hung for some reason

If that result never comes, ansible will appear to just hang.

You can check out this gist for attempting to diagnose why a module execution may be happening on a target host: How to debug a hanging ansible module · GitHub

1 Like

I found the issue :face_exhaling: , somewhere at somepoint ssh tried to ask me if i trust a new servers ip adress i hadn’t yet connected to…
(i saw this after i outputted the ansible stuff to a file instead of console)

Playbook run took 0 days, 0 hours, 11 minutes, 58 seconds with a serial batch size of 10 and 10 forks.

I think i’ll try and make a loop before the main task to go through all servers and see that i can connect to them, should be more easily visible then i think :thinking:

You can avoid that if you set
$ export ANSIBLE_HOST_KEY_CHECKING=False

https://docs.ansible.com/ansible/latest/inventory_guide/connection_details.html#managing-host-key-checking