Ansible 1.7 shell module doesnt wait for background jobs to finish

Hi,

I’m using Ansible to install Oracle and it’s been working great on 1.6, but when I hit 1.7 some of the tasks using the shell module started behaving differently, specifically with jobs in the background.
The oracle installer (runInstaller) is a shell script that kicks off a java process (and puts it in the background) which then performs the actual installation. In 1.6 the play waited for the background job to finish before moving on to the next task, but from 1.7 it just waits for the ‘kickoff’ script to come back and then moves on → the play fails.

I’m not sure if the old behaviour is the correct one, but I certainly hope so.

I’'ve got a small testcase which exactly mimics the behaviour I’m seeing. Gist is here (2 shellscripts & a playbook)

kickoff.sh : Starts another script (sleep.sh) in the background
sleep.sh: Does a few echo’s with a sleep inbetween

1.6 behaviour

`
[miksan@ponderstibbons ansible]$ ansible --version
ansible 1.6.10
[miksan@ponderstibbons ansible]$ time ansible-playbook background.yml

PLAY [localhost] **************************************************************

TASK: [run shellscript] *******************************************************
changed: [localhost]

TASK: [debug var=sleep.stdout_lines] ******************************************
ok: [localhost] => {
“sleep.stdout_lines”: [
“Kicking off other script at Thu Sep 25 10:54:38 CEST 2014”,
“All finished. Returned from other script at Thu Sep 25 10:54:38 CEST 2014”, # <— kickoff.sh finishes, but waits for sleep.sh to finish
“Starting /tmp/sleep.sh at Thu Sep 25 10:54:38 CEST 2014”, # ← sleep.sh starts (in the background)
“Sleeping 30 seconds”,
“/tmp/sleep.sh Woke up”,
“Sleeping another 30 seconds”,
“/tmp/sleep.sh Done. Exiting /tmp/sleep.sh at Thu Sep 25 10:55:38 CEST 2014” # <— sleep.sh finishes
]
}

PLAY RECAP ********************************************************************
localhost : ok=2 changed=1 unreachable=0 failed=0

real 1m0.288s
user 0m0.147s
sys 0m0.039s
`

1.7 behaviour

`
[miksan@ponderstibbons ansible]$ ansible --version
ansible 1.7.2
[miksan@ponderstibbons ansible]$ time ansible-playbook background.yml

PLAY [localhost] **************************************************************

TASK: [run shellscript] *******************************************************
changed: [localhost]

TASK: [debug var=sleep.stdout_lines] ******************************************
ok: [localhost] => {
“sleep.stdout_lines”: [
“Kicking off other script at Thu Sep 25 10:45:15 CEST 2014”,
“All finished. Returned from other script at Thu Sep 25 10:45:15 CEST 2014”, # <— kickoff.sh finishes but doesnt wait for sleep.sh to finish
“Starting /tmp/sleep.sh at Thu Sep 25 10:45:15 CEST 2014”, # <— sleep.sh starts (in the background) but never gets to finish
“Sleeping 30 seconds”
]
}

PLAY RECAP ********************************************************************
localhost : ok=2 changed=1 unreachable=0 failed=0

real 0m1.291s
user 0m0.148s
sys 0m0.034s
`

Is this a bug in 1.7 (or 1.6)? How should I approach this?

regards
/Micke

Forgot to mention that I know how to work around this with async & polling, but I still would like to know if this is a bug or not.

/Micke

Hi Mikael,

This is not a bug, as the module can only assume that when the script returns that the task is finished - it has no way of knowing if the script started background or child processes. And even if it did, it would not know whether it should wait for those to exit or not (think of a script which starts a daemonized process).

So for your situation, I would say to modify the script to wait until its tasks are complete or to use async, as you noted.

Thanks!

Ok,

So this was a bug in 1.6 then?

/M

Ok, so this was a bug in 1.6 then?

/Micke

Actually yes, let me look into this with your example gist, I had not looked at it so I’m not sure what may have caused this to work in 1.6.

Yes.

(To correct something in the subject of this ticket, Ansible’s shell module definitely waits on jobs to finish)

What happened with 1.6 is in some rare cases (bad installers :)) an unclosed file descriptor of the daemonized process was keeping ansible from closing on time when a process already daemonized itself.

What I’d probably suggest is using async to kick off the oracle install and finding some way to wait for it to return, or really, asking Oracle how to run their script in batch mode.

I’d be very interested in the answer to why it’s daemonizing like that.

Does this script normally go interactive and leave you with a GUI or something like that?

Yeah, that’s what happens. It kicks off a GUI and then exits the ‘kickoff’ script.

I’ve sorted it though, so this is not an issue anymore. There is a flag to instruct the installer to not spawn another process, which I would have seen had I bothered to rtfm in more detail…

Anyways, thanks for the clarification.

regards
/Micke