We’ve been experiencing an issue for quite some time with put_file with the ssh connection plugin hanging, but haven’t had the opportunity to really dig into it.
We have seen up to 5 minutes between the PUT and the execution of the recently uploaded module file.
It turns out that the hang was on p.communicate() in put_file. It seems that put_file is susceptible to the same issues that we see in exec_command with ControlMaster:
“# We can’t use p.communicate here because the ControlMaster may have stdout open as well”
We run into scenarios where pipelining doesn’t work with password prompting, so we generally have it disabled.
I’ve resolved the issues by moving some code for subprocess.Popen and p.communicate code into 2 new methods _run and _communicate and call them from exec_command and put_file.
I can’t quantify this very well, but I have a few tasks that upload files/folders to remote servers (cached rpms, git folder, etc) and they almost always hang on the first run with the ControlMaster stuff set. This happens both to local vagrant images as well as remote Rackspace cloud images.
If I kill ansible and run it again, it always succeeds.
This was basically one of those perfect storm bugs, where when the stars align, we end up with a strange issue.
In any case, it was due largely to ‘sshpass’, which I will point out from testing is terribly slow and a bit inefficient. If possible stay away from sshpass and use SSH Keys. It is a much better way to go.
In combination with sshpass, when using a bastion (jump host), with ControlPersist, and ‘ssh -qa nc %h %p’ for a ProxyCommand we noticed that stdin failed to close properly and p.communicate() would just sit there waiting for the fd to close. This would eventually happen at the ServerAliveInterval or the wait timeout configuration for netcat and things would move along. The ServerAliveInterval was the cause to the variance in the “hang” times, as people had the value set from anywhere between 30 to 300 seconds.
In the end, all of the code was already in exec_command to handle this exact scenario (although from a slightly different root problem).
We did some performance testing, and found that the code changes did not impact performance in any discernible way.
Thank you to everyone who helped me track this down and get everything working!