We have some hundred servers I try to run ansible script against, if I’ve understood correctly it by default runs one host at a time, and with serial you can increase this. depending on if I have 100% or 10 as serial it just crashes at some point.
With serial: 10 i had 10 ssh connections constantly until when it hanged, after which i had only 1 and network usage spiked
It could be that you are experiencing network issues, and although the SSH connection is dead, the ssh client hasn’t recognized it. You may want to add -o ServerAliveInterval=60s or some other value to your ssh_extra_args.
It’s hard to say without more info. You might try using strace to see if you can determine what activity is happening in the processes.
Tough to say without more info but one thing I’ve encountered in the past is a dead NFS or NAS mount on a target machine causing my job to hang during the gather_facts stage
It always dies after the first of these two tasks are complete like 50servers in
TASK [amazon_user_accounts : Configure /etc/sudoers.d/] ***************************************************************************************************
TASK [amazon_user_accounts : Set authorized_key for the user (exclusive)] *********************************************************************************
Trying to strace the process, i’ll inform if i find anything more
You might have to log into the affected server that ssh is still open to, and determine what the module execution is doing.
Ultimately it sounds like ansible is waiting to get a module result back from a server that is never coming. When that happens it’s usually 1 of 2 things:
The connection is dead and ssh hasn’t noticed
the module on the remote side is hung for some reason
If that result never comes, ansible will appear to just hang.
I found the issue , somewhere at somepoint ssh tried to ask me if i trust a new servers ip adress i hadn’t yet connected to…
(i saw this after i outputted the ansible stuff to a file instead of console)
Playbook run took 0 days, 0 hours, 11 minutes, 58 seconds with a serial batch size of 10 and 10 forks.
I think i’ll try and make a loop before the main task to go through all servers and see that i can connect to them, should be more easily visible then i think