put_file hangs with ControlMaster

sivel · March 21, 2014, 9:09pm

We’ve been experiencing an issue for quite some time with put_file with the ssh connection plugin hanging, but haven’t had the opportunity to really dig into it.

We have seen up to 5 minutes between the PUT and the execution of the recently uploaded module file.

It turns out that the hang was on p.communicate() in put_file. It seems that put_file is susceptible to the same issues that we see in exec_command with ControlMaster:

“# We can’t use p.communicate here because the ControlMaster may have stdout open as well”

We run into scenarios where pipelining doesn’t work with password prompting, so we generally have it disabled.

I’ve resolved the issues by moving some code for subprocess.Popen and p.communicate code into 2 new methods _run and _communicate and call them from exec_command and put_file.

You can see the changes at https://github.com/sivel/ansible/compare/put-file-control-master

I just wanted to send this out to get feedback or whatnot and just to raise visibility.

With this change, what was taking upwards of 5 minutes now completes in about 1 second.

sivel · March 21, 2014, 10:00pm

I’ll probably open an issue soonish about this, as it may be easier to track than an email thread.

Michael_DeHaan1 · March 21, 2014, 11:39pm

Very very curious.

Possible network issue?

Christopher_H_Laco · March 22, 2014, 12:18am

I can’t quantify this very well, but I have a few tasks that upload files/folders to remote servers (cached rpms, git folder, etc) and they almost always hang on the first run with the ControlMaster stuff set. This happens both to local vagrant images as well as remote Rackspace cloud images.

If I kill ansible and run it again, it always succeeds.

Michael_DeHaan1 · March 22, 2014, 12:19am

Any debugging info you can provide would be appreciated.

sivel · March 26, 2014, 7:23pm

I just wanted to follow up here to give everyone some insight into the issue and the ultimate resolution.

First, I’ll point to the pull request (that has been merged into devel) for anyone interested, and has a bit more info.

https://github.com/ansible/ansible/pull/6708

This was basically one of those perfect storm bugs, where when the stars align, we end up with a strange issue.

In any case, it was due largely to ‘sshpass’, which I will point out from testing is terribly slow and a bit inefficient. If possible stay away from sshpass and use SSH Keys. It is a much better way to go.

In combination with sshpass, when using a bastion (jump host), with ControlPersist, and ‘ssh -qa nc %h %p’ for a ProxyCommand we noticed that stdin failed to close properly and p.communicate() would just sit there waiting for the fd to close. This would eventually happen at the ServerAliveInterval or the wait timeout configuration for netcat and things would move along. The ServerAliveInterval was the cause to the variance in the “hang” times, as people had the value set from anywhere between 30 to 300 seconds.

In the end, all of the code was already in exec_command to handle this exact scenario (although from a slightly different root problem).

We did some performance testing, and found that the code changes did not impact performance in any discernible way.

Thank you to everyone who helped me track this down and get everything working!

Topic		Replies	Views
Usage of python's subprocess.Popen.communicate() Ansible Developer	14	9	January 16, 2014
module hangs at "getting output with communicate()" Ansible Developer	0	17	November 16, 2018
Initial experiences: apparent hangs. Ansible Project	1	2	August 31, 2013
file module hangs Ansible Project ubuntu	2	3	February 16, 2015
Windows put_file performance (please test!) Ansible Developer windows , aws	27	16	May 11, 2016

put_file hangs with ControlMaster

Related topics