Is it possible to avoid aborting a play if all hosts in a serial batch fail?

Daniel_Heffner · February 28, 2018, 9:09pm

So, for example, i have a playbook:

`

hosts: some_hosts
serial: 50
roles:
some_role

… some other play with different hosts
`

and for some reason, all 50 hosts in a batch fail. I want to continue running the play on the rest of the hosts in the some_hosts group, and continue on to the rest of the plays in the playbook.

I’ve tried adding max_fail_percentage: 100 but I suppose it was too much to hope that that would be a way to hack around the issue.

Anyone know of a way to do this?

system · March 1, 2018, 12:13am

That is because serial is specifically set to kill play if all hosts
in a batch fail, mail_fail_percentage is calculated on the serial
batch, not the full list of hosts for the play.

I would ask, why are you using serial if you don't want the feature?

Daniel_Heffner · March 1, 2018, 12:18am

short answer, I’m trying to solve an issue where I run out of memory parsing output by limiting the number of hosts being processed at one time.

isn’t the feature of serial the ability to run through hosts in batch, not the behavior of killing a playbook run if some of the hosts fail?

Daniel_Heffner · March 1, 2018, 12:24am

also, the documentation specifically says that max_fail_percentage needs to be exceeded. you can’t exceed 100%, so I hoped that would work.

I suppose this behavior is nonsensical to me. If I was baking cookies and got halfway through the dough when I completely burned one batch, I wouldn’t just toss out the rest of the cookie dough and then refuse to make dinner.

system · March 1, 2018, 12:33am

If you are doing it for memory issues, I believe you just want --forks
50 instead.

Daniel_Heffner · March 1, 2018, 12:42am

isn’t the default forks=5? it doesn’t seem like increasing the number of parallel processes would address the issue of running g out of memory while parsing text output from hundreds of hosts

system · March 1, 2018, 12:46am

Yes, that is the default, not sure how serial 50 helps with the memory
then, as Ansible always uses the lowest number of the 2.

Daniel_Heffner · March 1, 2018, 12:50am

serial=50 means than instead of trying to parse and hold in memory text output from over 500 hosts, you only have to do it for 50 at a time.

unless the documentation is severely confusing, forks and serial are not the same thing. forks is number of hosts to run in parallel, serial is how many hosts to put through the play at until the list of hosts is exhausted

system · March 1, 2018, 1:00am

Yes, but if you set --forks 500 and serial 50, ansible will only fork
on the lower number, this is sometimes used as a fork limiter
(incorrectly) which is what i assumed you were doing.

Serial will batch the hosts to run through play, but that should not
create less/more memory consumption unless its bringing down the
number of forks.

Daniel_Heffner · March 1, 2018, 1:20am

so let’s pretend the task list is

ssh to the host and execute a command that returns a large amount of text
parse that text into json
send the json to an API

consider the difference between holding 50 objects that each take up 5 mb and possibly over 500 objects that each take up 5 mb. that’s why I’m trying to use serial - to limit the amount of memory I have to use at each step.

because I love metaphors:
I have to carry 500 bags from point A, put something in them at point B, and drop them off at point C. I’m not strong enough to carry all 500 filled bags at once, but I can string 10 on my arms, so I carry 10 at a time - I split them into batches. If at one point, all 10 bags that I’m carrying broke, I’m not going to give up and leave the rest of the bags unfilled and untransported. that’s how I want to use serial: to limit the weight my server has to carry at any given time.

system · March 1, 2018, 2:47am

That assumes ansible does not hold registred data until the end of the
run, but instead clears it when the host is done in play, that is an
incorrect assumption.

Daniel_Heffner · March 1, 2018, 3:06am

that’s a good point, and it indicates that I’m probably barking up the wrong tree with serial and I just need to increase the memory on the VM.

it doesn’t answer why there’s no way to set a playbook to continue even if a serial batch fails, but I suppose that’s irrelevant to me now.

Josh_Smift2 · March 1, 2018, 1:59pm

This is a good metaphor, and I’ve never understood why serial works this way. We’ve repeatedly had situations where we want to do things on a few hosts at a time, and continue even if one batch of them fails – especially if the batch size is 1 – and there’s just no way to do that.

I understand the use cases for wanting to fail if all your batches fail: To stick with your metaphor, maybe all ten bags in your batch broke, you stumbled, and sprained your ankle. At that point, you do want to stop, and not hurt yourself more.

That should be a decision that you can make on a case-by-case basis, though; saying “any failure == stop the whole play” seems obviously wrong. (And there are lots of other places where you can control how much failure == stop the whole play; this is just a weird exception to that principle.)

Josh_Smift2 · March 1, 2018, 2:01pm

Oops, I misspoke here: I meant "the use cases for wanting to fail the whole play if all the hosts in one batch fail.

To switch back to the cookies metaphor: Sometimes that’s what you want (your oven caught fire); sometimes it’s not (you burned one batch).

Topic		Replies	Views
Changed behavior with Serial? Ansible Developer	0	8	February 19, 2016
Serial playbook control is now a core feature! Ansible Project	0	5	August 18, 2012
Batch size (serial) and playbook failing Ansible Project	4	39	June 19, 2021
Aborting playbook execution during rolling updates? Ansible Project	2	3	March 31, 2014
Abort entire run at one task failure Ansible Project	11	13	September 6, 2013

Is it possible to avoid aborting a play if all hosts in a serial batch fail?

Related topics