Usage of python's subprocess.Popen.communicate()

Hi, I wasn’t sure whether to report this as a bug or not.

The issue was this: I have an rsync task that I use for efficiently copying directory hierarchies to remote machines. The symptom is that it would sometimes hang during the rsync task. ps would reveal that the rsync process was done, but being held onto by the parent process (defunct). If I killed ansible and re-ran the playbook, it would usually work.

After a bunch of troubleshooting, I finally realized what the issue was. I have this in my ssh configuration (among other stuff):

ControlPath ~/.ssh-control/%r@%h:%p
ControlMaster auto

When there isn’t already a control connection open, ssh (invoked by rsync) will create a new one. This new process outlives the rsync/ssh that spawned it, and holds onto stderr. Ansible is using Popen.communicate() from python’s subprocess module, which not only blocks until the process finishes (that is, until wait() returns) but also blocks until stdout and stderr from the command are closed.

I think this is bad behavior – instead, ansible should wait() for the process to finish only. (I mean the wait syscall – the Popen.wait() documentation lists some caveats about deadlocks, so the changes required probably involve reading reading the buffers itself or something. I’m not sure of the details because I’m not really a Python user, but in other languages you certainly can call another process and wait for it to finish without waiting on its stdout/stderr to be closed, so I’m sure it’s possible with Python as well.)

I also think ssh’s behavior itself is wonky, and I sent the openssh dev ML (openssh-unix-dev@mindrot.org) a note as well. (You can read the January archive here and search for my name to see the email).

Thoughts about this?

Thanks!
-Caleb

First, please let us know what version of Ansible you are using.

I assume you may be referring to this, but it would have been helpful if you referenced line numbers. (Please do?)

This is from ssh_alt.py actually, on the devel branch:


    # only break out if we've emptied the pipes, or there is nothing to

            # read from and the process has finished.
            if (not rpipes or not rfd) and p.poll() is not None:

                break

That’s actually pretty vital that it stays in there, though I’d welcome experimentation that keeps equivalent behavior.

There can still be input left after execution, which we’ve observed empirically.

Apologies; I actually forgot several pieces of information I meant to give.

ansible --version says ‘ansible 1.5’ (we’re using a particular pinned commit that has some post-1.4 fixes).

The code I’m looking at is in the ~/.ansible/tmp/[RANDOM ID THING]/command when I run an ansible command. It appears to be copied from lib/ansible/module_utils/basic.py:

https://github.com/ansible/ansible/blob/devel/lib/ansible/module_utils/basic.py#L1007

That’s the line where it calls Popen.communicate().

I can actually reproduce this completely locally if I add localhost to my $ANSIBLE_HOSTS file (and have passwordless ssh allowed to localhost):

First kill any open ssh control connection to localhost

$ ps -ef | grep “ss[h]: .*localhost” | awk ‘{print $2}’ | xargs kill 2>/dev/null
$ ansible localhost -a ‘ssh localhost ls’ # hangs – eventually I ctrl-c
$ ansible localhost -a ‘ssh localhost ls’ # works the second time because there’s already a control connection running

As I noted in my original email, I have ControlMaster auto in my ~/.ssh/config. If I turn it to off, then the problem does not exist.

Ok, so that’s not asking about the connection. Invoking ssh localhost is pretty bizarre ansible usage and the module_utils function is a pretty standard communicate call pattern.

I suspect what’s happening is your command really isn’t exiting, but I’d be curious how you could make it better.

The correct Ansible patterns here would be do this:

  • local_action: shell ls

Or if remote:

  • shell: ls

And I know lots of people did “local_action: rsync” fine before the synchronize module existed, and pretty much everyone on Ubuntu is using ControlMaster auto by default.

See the defaults in ansible.cfg.


#ssh_args = -o ControlMaster=auto -o ControlPersist=60s

Ok, so that's not asking about the connection. Invoking ssh localhost is
pretty bizarre ansible usage and the module_utils function is a pretty
standard communicate call pattern.

I think my original email was pretty clear. The problem is when calling
rsync, but just calling ssh directly reproduces the issue and skips the
middleman. Obviously I'm not calling ssh in my actual deployment code.

I suspect what's happening is your command really isn't exiting, but I'd
be curious how you could make it better.

My command absolutely is exiting.

# First kill any open ssh control connection to localhost
$ ps -ef | grep "ss[h]: .*localhost" | awk '{print $2}' | xargs kill
2>/dev/null
$ ansible localhost -a 'ssh localhost ls && echo done >> /tmp/foobar'

Now it hangs. You can look in /tmp/foobar and see that it's done.
Furthermore, you can ps -ef | grep ssh and see the process:

caleb 15918 15917 0 20:01 pts/1 00:00:00 [ssh] <defunct>

It's defunct because it has exited and the parent process (the ansible
program in ~/.ansible/tmp/...) is still holding onto it.

I can prove that it's stuck on the communicate() line waiting for stderr to
close:

# First kill any open ssh control connection to localhost
$ ps -ef | grep "ss[h]: .*localhost" | awk '{print $2}' | xargs kill
2>/dev/null
$ ansible localhost -a 'bash -c "exec 2>&- ssh localhost ls"' # does NOT
hang

(2>&- is a bashism for closing stderr)

The correct Ansible patterns here would be do this:

- local_action: shell ls

Yes, my real code where I discovered this issue does

- local_action rsync ...

Or if remote:

- shell: ls

And I know lots of people did "local_action: rsync" fine before the
synchronize module existed, and pretty much everyone on Ubuntu is using
ControlMaster auto by default.

Ubuntu does not use ControlMaster by default afaik. ControlMaster is 'no'
by default (see man ssh_config) and /etc/ssh/ssh_config on my Ubuntu 13.10
machine does not set ControlMaster. If you are referring to Ansible's own
ssh invocation, see below...

See the defaults in ansible.cfg.

#ssh_args = -o ControlMaster=auto -o ControlPersist=60s

Ansible's ssh settings have nothing to do with this. What matters is the
system's ssh configuration, because we're invoking ssh (through rsync) by
shelling out.

By the way, thanks for pointing out the synchronize module. Last I checked (when I wrote my own rsync task which spawned this thread) it wasn’t merged. I’ll be glad to use that instead of our custom code.

That module is doing essentially the same thing as my rsync task, and I can reproduce the hang with it as well (in the same manner as I’ve been describing, with ControlMaster auto in my ssh configuration and killing the ssh connection beforehand.)

“I think my original email was pretty clear.”

Alas, but I didn’t.

So in your case, you’re initiating rsync from the remote, and not using local_action.

Can you look into whether synchronize works for you, assuming you are rsyncing files from a given host to another, and one is not local host, please see this example:

# Synchronization of src on delegate host to dest on the current inventory host
synchronize: >
    src=some/relative/path dest=/some/absolute/path
    delegate_to: delegate.host
Which is linked here.
[http://docs.ansible.com/synchronize_module.html](http://docs.ansible.com/synchronize_module.html)

"I think my original email was pretty clear."

Alas, but I didn't.

So in your case, you're initiating rsync from the remote, and not using
local_action.

Oh I see the problem. Yes, my repro is running ssh as a remote action. This
incidentally does reproduce the behavior, but I see why that's confusing.

Let me try again to explain what's going on.

Here is my rsync task that hangs: https://gist.github.com/cespare/8438518

Here is a minimal playbook that reproduces the issue using ssh directly
instead of using rsync: https://gist.github.com/cespare/8438767

We'll use localhost as the remote host as well, but it can be anything.
First, make sure localhost appears somewhere in the ANSIBLE_HOSTS file.
Next, kill any existing open ssh connections to localhost:

$ ps -ef | grep "ss[h]: .*localhost" | awk '{print $2}' | xargs kill
2>/dev/null

Run the first test:

$ ansible-playbook hang_playbook.yml -e 'hosts=localhost task_to_run=1'

Kill the connections again, and then run the second test:

$ ansible-playbook hang_playbook.yml -e 'hosts=localhost task_to_run=2'

The first test hangs, but the second does not. The reason is the same as
I've been explaining: Popen.communicate() is waiting for stderr to close.

Can you look into whether synchronize works for you, assuming you are
rsyncing files from a given host to another, and one is not local host,
please see this example:

I think you missed my follow-up email above. I *have tried* the synchronize
module; and it's doing the same thing as my own rsync task (running rsync
as a local_task), and it also has the hanging issue that I've been
describing.

yeah to be clear I don’t think it’s illegal to run ssh as a remote action, but the likelihood of SSH going into some kind of prompt mode when you are doing that is quite high (host key not added yet, needs password, credentials not forwarded, etc).

So I wanted to be sure to rule that out.

So do we agree that this is an issue? Should I file an issue and/or send a pull request? My fix would involve using subprocess in a different way (not using communicate) so that we wait for the process to finish but not for it to close stdout/stderr.

I don’t agree it’s an issue yet as I haven’t had time to reproduce it, though I’m not saying it’s not – but it hasn’t been something that’s been reported before (in two years).

(It sounds like there’s a high chance ssh is going interactive and asking you a question, if I had to bet – but I could be wrong)

If you can find a way to speak the run code that solves your problem, I’d be interested in seeing it, so would you like to poke around?.

I don't agree it's an issue yet as I haven't had time to reproduce it,
though I'm not saying it's not -- but it hasn't been something that's been
reported before (in two years).

Most folks don't use ControlMaster auto in their ~/.ssh/config.

(It sounds like there's a high chance ssh is going interactive and asking
you a question, if I had to bet -- but I could be wrong)

I have provided a lot of evidence that this is not the case, and it's the
communicate() thing I described.

   - The ssh process (or sh process, if run through a shell) is defunct --
   i.e. not running
   - Things done after ssh in the command execute (see the '&& echo done >>
   /tmp/foobar example)
   - Under the same conditions, redirecting stderr to /dev/null fixes the
   issue

If you can find a way to speak the run code that solves your problem, I'd
be interested in seeing it, so would you like to poke around?.

Sure, I'll send over a PR when I get a chance to look into it further.

I’ve been using control master for a long time, I was using it with ansible even before it was incorporated as a default option (most used paramiko because it was faster w/o it) and I’ve never encountered these kind of issues w/o it being caused by a command or script keeping a tty resource open (stdin/out/err).

I've been using control master for a long time, I was using it with
ansible even before it was incorporated as a default option (most used
paramiko because it was faster w/o it) and I've never encountered these
kind of issues w/o it being caused by a command or script keeping a tty
resource open (stdin/out/err).

That's what's happening. ssh is keeping stderr open.

That use case is one of the reasons why OpenSSH's BatchMode ("-oBatchMode=yes") exists.