Hi, I wasn’t sure whether to report this as a bug or not.
The issue was this: I have an rsync task that I use for efficiently copying directory hierarchies to remote machines. The symptom is that it would sometimes hang during the rsync task. ps would reveal that the rsync process was done, but being held onto by the parent process (defunct). If I killed ansible and re-ran the playbook, it would usually work.
After a bunch of troubleshooting, I finally realized what the issue was. I have this in my ssh configuration (among other stuff):
ControlPath ~/.ssh-control/%r@%h:%p
ControlMaster auto
When there isn’t already a control connection open, ssh (invoked by rsync) will create a new one. This new process outlives the rsync/ssh that spawned it, and holds onto stderr. Ansible is using Popen.communicate() from python’s subprocess module, which not only blocks until the process finishes (that is, until wait() returns) but also blocks until stdout and stderr from the command are closed.
I think this is bad behavior – instead, ansible should wait() for the process to finish only. (I mean the wait syscall – the Popen.wait() documentation lists some caveats about deadlocks, so the changes required probably involve reading reading the buffers itself or something. I’m not sure of the details because I’m not really a Python user, but in other languages you certainly can call another process and wait for it to finish without waiting on its stdout/stderr to be closed, so I’m sure it’s possible with Python as well.)
I also think ssh’s behavior itself is wonky, and I sent the openssh dev ML (openssh-unix-dev@mindrot.org) a note as well. (You can read the January archive here and search for my name to see the email).
Thoughts about this?
Thanks!
-Caleb