put_file hangs with ControlMaster

We’ve been experiencing an issue for quite some time with put_file with the ssh connection plugin hanging, but haven’t had the opportunity to really dig into it.

We have seen up to 5 minutes between the PUT and the execution of the recently uploaded module file.

It turns out that the hang was on p.communicate() in put_file. It seems that put_file is susceptible to the same issues that we see in exec_command with ControlMaster:

“# We can’t use p.communicate here because the ControlMaster may have stdout open as well”

We run into scenarios where pipelining doesn’t work with password prompting, so we generally have it disabled.

I’ve resolved the issues by moving some code for subprocess.Popen and p.communicate code into 2 new methods _run and _communicate and call them from exec_command and put_file.

You can see the changes at https://github.com/sivel/ansible/compare/put-file-control-master

I just wanted to send this out to get feedback or whatnot and just to raise visibility.

With this change, what was taking upwards of 5 minutes now completes in about 1 second.

I’ll probably open an issue soonish about this, as it may be easier to track than an email thread.

Very very curious.

Possible network issue?

I can’t quantify this very well, but I have a few tasks that upload files/folders to remote servers (cached rpms, git folder, etc) and they almost always hang on the first run with the ControlMaster stuff set. This happens both to local vagrant images as well as remote Rackspace cloud images.

If I kill ansible and run it again, it always succeeds.

Any debugging info you can provide would be appreciated.

I just wanted to follow up here to give everyone some insight into the issue and the ultimate resolution.

First, I’ll point to the pull request (that has been merged into devel) for anyone interested, and has a bit more info.

https://github.com/ansible/ansible/pull/6708

This was basically one of those perfect storm bugs, where when the stars align, we end up with a strange issue.

In any case, it was due largely to ‘sshpass’, which I will point out from testing is terribly slow and a bit inefficient. If possible stay away from sshpass and use SSH Keys. It is a much better way to go.

In combination with sshpass, when using a bastion (jump host), with ControlPersist, and ‘ssh -qa nc %h %p’ for a ProxyCommand we noticed that stdin failed to close properly and p.communicate() would just sit there waiting for the fd to close. This would eventually happen at the ServerAliveInterval or the wait timeout configuration for netcat and things would move along. The ServerAliveInterval was the cause to the variance in the “hang” times, as people had the value set from anywhere between 30 to 300 seconds.

In the end, all of the code was already in exec_command to handle this exact scenario (although from a slightly different root problem).

We did some performance testing, and found that the code changes did not impact performance in any discernible way.

Thank you to everyone who helped me track this down and get everything working!