async timeout/polling not picking up finished process?

I have a relatively simple action that the best I can tell after some testing doesn’t pick up on the fact that the action has completed - it will run to the end of the async time and then produce a “fail” status even though the action itself completed successfully. Here’s the action that I have in my playbook:

  • name: Run the installer
    action: shell chdir=/${homedir}/${copydir} ./setup.sh autoInstall-mitch.xml > installoutput.txt
    async: 1800
    poll: 15

A few things to note about this action:

  • It is a long running process, probably takes about 15-20 solid minutes to complete.
  • Without the async command in there, the remote system times me out after about 10 minutes or so. I think that’s what’s happening as without the async command it just hangs there even after the command on the endpoint finished.

I’m going to try it as a “command” call instead of “shell” and see if that makes a difference, and if not I’ll try it without redirecting the output. Hopefully the latter isn’t necessary as that’s how I get the results of the installation locally in order to review as necessary on a variety of endpoints.

Just wondering if anyone had any immediate thoughts that came to mind.

Thanks!

Yep, async is definitely there as a way to get around SSH timeouts and things like that.

Command vs shell shouldn’t make a difference.

I suspect the failure is from it returning a non-zero exit code.

Thanks Michael - that’s what I thought and my testing results this morning (I ran last night and left the office - will check when I get in) I’m guessing will confirm your belief that shell vs command doesn’t matter.

Something weird is going on, however. The setup.sh script that’s being run actually kicks off a java installer (run headless) and ultimately returns zero I think in all cases (we have a bug for this in with the dev team). However, the script and the associated java processes quit long before Ansible’s async polling stops - I’ve logged into the target machine via ssh and watched to check on this. Ansible continues to poll even though the processes that it kicked off have exited.

How does the async process determine if the process is still active? Does it look for a process ID when it polls, or is there something else going on? Could the fact that I’m redirecting output from the script to a file have some effect?

Finding a minimal reproducible example and sharing on github may be the best way to track down what may be going on here, but it seems most likely that at least one of the processes haven’t returned.

I won’t say that there could not be some bugs lurking around the async implementation though. Help digging in on your end would be greatly appreciated as I don’t think we could replicate with a simple example on github – if you can though, that would be great!

–Michael

Thanks Michael. I fiddled around with it today and it ended up working but I need to replicate the original problem and report back as to what fixed it. Per the original post, I’ve since removed the redirected output portion of the command (for whatever reason that appeared to spawn two shell calls on the endpoint - I’ll see if I can replicate) and also increased the async time (I doubt that was the problem - there was plenty of time before). It’s running again now but honestly it takes about 30-45 minutes to see if it works or not. I’ll see if I can replicate with a simple example that is something you can more easily replicate on your end.

Michael - I wrote a quick test playbook and script but couldn’t replicate. On top of that, I started rolling some of the changes back into the original playbook that I was having problems with and the problem isn’t appearing anymore either. I’ll watch for this again and let you know if it crops up, and try to replicate then. My favorite type of problem - one that disappears! :slight_smile: