Abort entire run at one task failure

Hello everyone,

I have a rolling upgrade performed over an inventory of say 20 hosts.The playbook that runs this uses serial: 3, so 3 hosts only are being upgraded at a time.

What I want to achieve right now is: for any given batch of 3 hosts - if any one of those fails (any task) then present a prompt to the user and allow to abort the entire run - i.e. terminate and skip the remaining hosts in the inventory.

As far as I can tell the "prompt: " module is the only way for me to abort the entire run of those 20 hosts (or whatever is left at the point of the failure). It pauses at the end of each 3 and waits. What I am missing is a way to detect if any of those 3 failed or not.I want to present that prompt only if one of those three hosts has failed. If none failed - continue straight to the next batch of 3.

Since the prompt task is at the end of the playbook it’s not even executed if any task before it fails. I don’t want to have “ignore_errors: True” everywhere since I do want the playbook to terminate and not continue.

As if I need a section which is executed at the end of each batch and I need to test if anything failed within that individual batch.

I hope it makes sense. I can elaborate.

Thank you,

You can set the attribute “max_fail_percentage” on a 1.3 playbook and control the amount of failures to tolerate in a single batch size from within a rolling update.

There is also “any_errors_fatal: True” which is not specific to the “serial” keyword, and can also cause failures on exactly 1 failure.

Thank you Michael,

That would do what I am aiming at.

I just tested this and I get an exception:

With “any_errors_fatal: True” :

TASK: [fail FAIL] *************************************************************
skipping: [hostname.com]
failed: [hostname.com] => {“failed”: true}
msg: Failed as requested from task
Traceback (most recent call last):
File “/usr/local/bin/ansible-playbook”, line 268, in
sys.exit(main(sys.argv[1:]))
File “/usr/local/bin/ansible-playbook”, line 208, in main
pb.run()
File “/Library/Python/2.7/site-packages/ansible/playbook/init.py”, line 262, in run
if not self._run_play(play):
File “/Library/Python/2.7/site-packages/ansible/playbook/init.py”, line 580, in _run_play
if (hosts_count - len(host_list)) > int((play.max_fail_pct)/100.0 * hosts_count):
TypeError: object of type ‘NoneType’ has no len()

This is “max_fail_percentage: 0” it works OK.

I have “serial: 2” in both cases.
Let me know if I can be of any help tracking down this issue.

Regards,
Rumen Telbizov

Hi Rumen,

That is a bug, When i added the feature (max_fail_percentage) , i did not test with “any_errors_fatal” . (sorry i never used that before)

Easy way would be to skip evaluation is host_list is None. Thoughts ?

Rumen:

Would you be willing to help to test the fix? When both max_fail_percentage and any_errors_fatal is used ?

I just checked in a possible fix, (BUT have NOT yet tested the fix, As i need to get an environment up to test it.) If you already have the environment to test it up. Can you please test it and see if this fixes the issue ?

Git repo :
https://github.com/kavink/ansible

Fix commit:

https://github.com/kavink/ansible/commit/a075ec9831ba1096af41c1d8d20eaf1b8e2909f7

Hi Kavin,

Thanks, Fix should hopefully address your issue. So my analysis (Others can correct)

So whats happening is when you set any_errors_fatal as true, It makes host_list as None and then immediately next it fails.


                if task.any_errors_fatal and len(host_list) < hosts_count:

                    host_list = None

                # If threshold for max nodes failed is exceeded , bail out.

                if (hosts_count - len(host_list)) > int((play.max_fail_pct)/100.0 * hosts_count):

                    host_list = None

Hey Kavin, list,

I just tested your proposed fix and it is indeed working fine for me.

I’d suggest that you incorporate that fix in devel right now.

Michael, I was wondering now with the one week delay of the 1.3 release -

is this fix going to be part of the 1.3 release ?

Thanks,
Rumen Telbizov

If we get a pull request on github, I can get this merged in before next Friday.

That would be awesome. It would be nice if this bug doesn’t surface the release.

I believe Kavin can create the pull request but in case doesn’t I think you can incorporate his fix manually. Here’s the commit

https://github.com/kavink/ansible/commit/a075ec9831ba1096af41c1d8d20eaf1b8e2909f7

It’s pretty simply (a few lines of change).

Cheers,
Rumen Telbizov

any_errors_fatal may be reimplementable to just set max_fail_percentage to some episilon such as 0.00000000000001

Maybe not, just throwing that out there

Just sent a pull request.

https://github.com/ansible/ansible/pull/4058