This is a two part question. I sort of worked around the problem I asked about yesterday (third task crashes on third of 4 machines) by splitting up my playbook into 3 parts. Ugly, but for now it works, at least on four machines at a time, testing bigger inventory now.
Possibly pertinent information: We deploy into a closed network (behind a vpn) which is composed of over 500 vm’s on hardware spread across the US. The ssh connection to these machines is SLOW due to a config problem that I am not, at the moment, allowed to correct. As such, it ca take as long a two + minutes from when I request an ssh connection to when it actually establishes. Again, I know what the issue is and how to fix it, but am not allowed to at the moment.
What I’ve noticed: as ansible moves through the playbook, it moves up and down the inventory list. In other words, it will start task one on machine one (of 5 let’s say) and move sequentially (seemingly in pairs) through to machine 5, then the next task starts on machine 5 and moves up the list from 4 to 3 to 2 to 1. I’ve notice that each task executes quickly on the first two machines in a task, and that they were the last two machines in the previous task. I’ve also noted that each task is executed quite quickly on a consecutive pair of machines (regardless of geographic location btw) but then there is a lonnnnnnng delay between the same task executing on the next pair.
Questions: This leads me to wonder: is there a timeout occurring in ansible? Does it actually open ssh tunnels in pairs?
I can’t fix the built in delay in the vm’s right now, so am hoping to make changes in ansible until this can be fixed in the hosts, but right now my suspicions that the process is timing out due to long delay between the starting pair of one task until they are the end pair of the next task is causing some sort of time out issue.
Then again my theory could be hooey and we’re just cursed.
regards, Richard
ps: while I wrote the above I was running a playbook that makes a directory and moves and untars a 14.5 MB file using unarchive. This worked fine on a 4 machine inventory. It crashed on the unarchive task after the third machine.