Is it possible to avoid aborting a play if all hosts in a serial batch fail?

So, for example, i have a playbook:

`

  • hosts: some_hosts
    serial: 50
    roles:
  • some_role

… some other play with different hosts
`

and for some reason, all 50 hosts in a batch fail. I want to continue running the play on the rest of the hosts in the some_hosts group, and continue on to the rest of the plays in the playbook.

I’ve tried adding max_fail_percentage: 100 but I suppose it was too much to hope that that would be a way to hack around the issue.

Anyone know of a way to do this?

That is because serial is specifically set to kill play if all hosts
in a batch fail, mail_fail_percentage is calculated on the serial
batch, not the full list of hosts for the play.

I would ask, why are you using serial if you don't want the feature?

short answer, I’m trying to solve an issue where I run out of memory parsing output by limiting the number of hosts being processed at one time.

isn’t the feature of serial the ability to run through hosts in batch, not the behavior of killing a playbook run if some of the hosts fail?

also, the documentation specifically says that max_fail_percentage needs to be exceeded. you can’t exceed 100%, so I hoped that would work.

I suppose this behavior is nonsensical to me. If I was baking cookies and got halfway through the dough when I completely burned one batch, I wouldn’t just toss out the rest of the cookie dough and then refuse to make dinner.

If you are doing it for memory issues, I believe you just want --forks
50 instead.

isn’t the default forks=5? it doesn’t seem like increasing the number of parallel processes would address the issue of running g out of memory while parsing text output from hundreds of hosts

Yes, that is the default, not sure how serial 50 helps with the memory
then, as Ansible always uses the lowest number of the 2.

serial=50 means than instead of trying to parse and hold in memory text output from over 500 hosts, you only have to do it for 50 at a time.

unless the documentation is severely confusing, forks and serial are not the same thing. forks is number of hosts to run in parallel, serial is how many hosts to put through the play at until the list of hosts is exhausted

Yes, but if you set --forks 500 and serial 50, ansible will only fork
on the lower number, this is sometimes used as a fork limiter
(incorrectly) which is what i assumed you were doing.

Serial will batch the hosts to run through play, but that should not
create less/more memory consumption unless its bringing down the
number of forks.

so let’s pretend the task list is

  1. ssh to the host and execute a command that returns a large amount of text
  2. parse that text into json
  3. send the json to an API

consider the difference between holding 50 objects that each take up 5 mb and possibly over 500 objects that each take up 5 mb. that’s why I’m trying to use serial - to limit the amount of memory I have to use at each step.

because I love metaphors:
I have to carry 500 bags from point A, put something in them at point B, and drop them off at point C. I’m not strong enough to carry all 500 filled bags at once, but I can string 10 on my arms, so I carry 10 at a time - I split them into batches. If at one point, all 10 bags that I’m carrying broke, I’m not going to give up and leave the rest of the bags unfilled and untransported. that’s how I want to use serial: to limit the weight my server has to carry at any given time.

That assumes ansible does not hold registred data until the end of the
run, but instead clears it when the host is done in play, that is an
incorrect assumption.

that’s a good point, and it indicates that I’m probably barking up the wrong tree with serial and I just need to increase the memory on the VM.

it doesn’t answer why there’s no way to set a playbook to continue even if a serial batch fails, but I suppose that’s irrelevant to me now.

This is a good metaphor, and I’ve never understood why serial works this way. We’ve repeatedly had situations where we want to do things on a few hosts at a time, and continue even if one batch of them fails – especially if the batch size is 1 – and there’s just no way to do that.

I understand the use cases for wanting to fail if all your batches fail: To stick with your metaphor, maybe all ten bags in your batch broke, you stumbled, and sprained your ankle. At that point, you do want to stop, and not hurt yourself more.

That should be a decision that you can make on a case-by-case basis, though; saying “any failure == stop the whole play” seems obviously wrong. (And there are lots of other places where you can control how much failure == stop the whole play; this is just a weird exception to that principle.)

Oops, I misspoke here: I meant "the use cases for wanting to fail the whole play if all the hosts in one batch fail.

To switch back to the cookies metaphor: Sometimes that’s what you want (your oven caught fire); sometimes it’s not (you burned one batch).