Allow some %age of failures before quitting the playbook

Kavin_Kankeshwar · July 20, 2013, 4:23am

Hi,

I am trying to see how best I can quit running a playbook , if some %age of nodes have failed and not proceed with upgrading hosts.

Use case: Upgrading a Java webapp in rolling fashion, But if some %age of nodes have failed to quit the playbook and not go to other nodes.

Any recommendations?

Regards,

Michael_DeHaan2 · July 20, 2013, 3:43pm

Ansible is somewhat smart about a rolling update batch size already.

So if you have 200 nodes and set “serial: 10” it will quit if all 10 in the block fail.

It sounds like you want to quit the rolling update if you have less than 10% fail. I’m open to this idea.

You should file a feature request in github to make this configurable.

Kavin_Kankeshwar · July 21, 2013, 1:45am

Thanks, Yes. But i was not sure if “serial: 10” and 5 servers fail in each time, how would it behave.

So basically yes, saying at max from reachable nodes only max 10 % can fail, if from the first 50 nodes 10% fail, Stop the playbook and trigger some notification for investigation.

I created feature request on Github and just for my understanding: If this had to be implemented, The changes would be in this file right ? lib/ansible/playbook/init.py around line 545 ? or it would be somewhere else ?

Regards,

Michael_DeHaan2 · July 21, 2013, 4:22pm

Right, playbook batch handling is around there, exactly.

Currently it will go on with the update if the batch size has some successes, so your RFE is accurate.

Kavin_Kankeshwar · July 23, 2013, 11:22pm

I attempted to put this check: commit link . I did not spend much time/ Dont know internals too well.

But what this is trying to do is, if you put max_fail_pct: 10 and have 100 nodes in your inventory, with serial as 10. Then if more than 10 nodes have fail in first 2 batches, It will bail out and not go to the third batch.

I could not test it in larger scale, just did with max_fail_pct:25, 4 nodes and serial: 2 .

Let me know if i should send a pull request or do other changes ?

Regards,

Michael_DeHaan · July 24, 2013, 1:06am

IMHO, the percentage should be a percentage of each batch size, rather than the total.

Seems possibly easier to follow, and would not require adjusting so much?

–Michael

Kavin_Kankeshwar · July 24, 2013, 2:46am

Yes, Thanks that makes sense, But for my understanding is it ideal to keep continuing for rolling updates and keep removing servers from list till all fail or its better to fail by default (i.e. max_fail_pct: 0) , unless explicitly given ignore_errors: yes

current behavior (v1.2.2 ) is like (i.e. max_fail_pct: 100)

Regards,

Michael_DeHaan2 · July 24, 2013, 3:01pm

I think the choice of this should be up to the organization.

Right now if you are doing a rolling update on 200 and block size is 10, and a block fails it will not continue, which DOES catch fatal playbook errors before going on to another block, but does allow for one server to be down.

I think it should be up to the org.

Good thing is this proposal does that.

Topic		Replies	Views
Abort entire run at one task failure Ansible Project	11	7	September 6, 2013
ansible-playbook exits if host list empty, even if later plays could still run Ansible Developer	6	2	March 11, 2016
Aborting playbook execution during rolling updates? Ansible Project	2	1	March 31, 2014
Prevent single failure killing playbook Ansible Project	2	24	December 20, 2021
Batch size (serial) and playbook failing Ansible Project	4	23	June 19, 2021

Allow some %age of failures before quitting the playbook

Related topics