Allow some %age of failures before quitting the playbook

Hi,

I am trying to see how best I can quit running a playbook , if some %age of nodes have failed and not proceed with upgrading hosts.

Use case: Upgrading a Java webapp in rolling fashion, But if some %age of nodes have failed to quit the playbook and not go to other nodes.

Any recommendations?

Regards,

Ansible is somewhat smart about a rolling update batch size already.

So if you have 200 nodes and set “serial: 10” it will quit if all 10 in the block fail.

It sounds like you want to quit the rolling update if you have less than 10% fail. I’m open to this idea.

You should file a feature request in github to make this configurable.

Thanks, Yes. But i was not sure if “serial: 10” and 5 servers fail in each time, how would it behave.

So basically yes, saying at max from reachable nodes only max 10 % can fail, if from the first 50 nodes 10% fail, Stop the playbook and trigger some notification for investigation.

I created feature request on Github and just for my understanding: If this had to be implemented, The changes would be in this file right ? lib/ansible/playbook/init.py around line 545 ? or it would be somewhere else ?

Regards,

Right, playbook batch handling is around there, exactly.

Currently it will go on with the update if the batch size has some successes, so your RFE is accurate.

I attempted to put this check: commit link . I did not spend much time/ Dont know internals too well.

But what this is trying to do is, if you put max_fail_pct: 10 and have 100 nodes in your inventory, with serial as 10. Then if more than 10 nodes have fail in first 2 batches, It will bail out and not go to the third batch.

I could not test it in larger scale, just did with max_fail_pct:25, 4 nodes and serial: 2 .

Let me know if i should send a pull request or do other changes ?

Regards,

IMHO, the percentage should be a percentage of each batch size, rather than the total.

Seems possibly easier to follow, and would not require adjusting so much?

–Michael

Yes, Thanks that makes sense, But for my understanding is it ideal to keep continuing for rolling updates and keep removing servers from list till all fail or its better to fail by default (i.e. max_fail_pct: 0) , unless explicitly given ignore_errors: yes

current behavior (v1.2.2 ) is like (i.e. max_fail_pct: 100)

Regards,

I think the choice of this should be up to the organization.

Right now if you are doing a rolling update on 200 and block size is 10, and a block fails it will not continue, which DOES catch fatal playbook errors before going on to another block, but does allow for one server to be down.

I think it should be up to the org.

Good thing is this proposal does that.