Long-Running Commands Fail

I’m stuck on getting a long-running command to work.

Running a devel from yesterday morning

commit b90c2356c38685b969c6dc42e4f0cee583fdf431
Merge: 461ba57 c362a2e
Author: Michael DeHaan michael.dehaan@gmail.com
Date: Thu May 10 05:05:50 2012 -0700

I am trying to run the Plesk installer, which takes about 8 or 9 minutes to complete. I ran it with the shell module, so I could redirect the roughly 300 KB of output to a file, and even tried async:1200 with poll:5 and get nowhere. Run synchronously, the playbook gets stuck and never returns an error and never moves to the next task. Running with async:1200 and poll:5, the playbook counts down to 880 seconds remaining and gets stuck.

Ideas?
– Art Z.

Sounds like your plesk installer went interactive or is otherwise fouled up to me.

Michael,

Nope; it does not go interactive. It just runs a long time. I even
tested once more just to be sure. Everything is running as expected (and
as we have been seeing it work for the last couple of years).

This is a 100% test machine. If you like, I can install your ssh public
key in root's .ssh/authorized_keys file and send you the playbook and
let you give it a try. Even better, it is a cloud server so I can
re-initialize it from an image in about 2 minutes, which makes repeated
testing reasonably time-efficient.

    -- Art Z.

Sounds like your plesk installer went interactive or is otherwise
fouled up to me.

Michael,

Nope; it does not go interactive. It just runs a long time. I even
tested once more just to be sure. Everything is running as expected (and
as we have been seeing it work for the last couple of years).

Hmm…. I’m still suspicious.

There isn’t a command timeout anymore.

Perhaps the lack of output for some time caused a general timeout in sshd?

I would be curious if the same problems exist on the master branch (with sudo mode disabled, because master
branch non-sudo code is a LOT different).

Anyway, async completely demonizes a process, so that should not have any such behavior as a non-async’d process.
Lack of returning things there shouldn’t be an issue.

This is a 100% test machine. If you like, I can install your ssh public
key in root’s .ssh/authorized_keys file and send you the playbook and
let you give it a try. Even better, it is a cloud server so I can
re-initialize it from an image in about 2 minutes, which makes repeated
testing reasonably time-efficient.

Yeah, I am not going to do this for various (time/legal) reasons but thanks very much for the offer.

I would recommend trying to debug further.

I've seen this before, and I haven't fully grokked it, but I think the
problem is lots of output, not long-running commands. Try:

ansible machine -a 'cat /usr/share/dict/words'

which hangs, vs.

ansible machine -a 'cat /etc/issue'

which does not.

The hang is in runner._exec_command() reading the stderr "file" from
Paramiko. I think the problem is that there isn't really a stderr
(since we're using a pty), so lots of stdout fills up the channel's
buffer and blocks stderr waiting for you to read stdout. Not sure if
that's an issue with Paramiko or not, but I have a one-line fix: just
make "stderr" an empty string.

https://github.com/ansible/ansible/pull/365

The rest of the code seems prepared to handle this, and the tests
pass.

-John

FWIW, I've seen this behavior as well (devel from yesterday). I had
mistakenly coded "tar xzvf" instead of "tar xzf". The verbose one
hung.

I just verified that jkleint's patch fixes this for me.

matt

Excellent, thx guys, will merge shortly.

-- Michael

Great news, guys. I will be patient and try again when you get the patch
merged and pushed instead of continuing to look for a workaround.

Cheers,
-- Art Z.

merged!

Tested. Success! Thank you, gentlemen! :slight_smile:

    -- Art Z.