and for some reason, all 50 hosts in a batch fail. I want to continue running the play on the rest of the hosts in the some_hosts group, and continue on to the rest of the plays in the playbook.
I’ve tried adding max_fail_percentage: 100 but I suppose it was too much to hope that that would be a way to hack around the issue.
That is because serial is specifically set to kill play if all hosts
in a batch fail, mail_fail_percentage is calculated on the serial
batch, not the full list of hosts for the play.
I would ask, why are you using serial if you don't want the feature?
also, the documentation specifically says that max_fail_percentage needs to be exceeded. you can’t exceed 100%, so I hoped that would work.
I suppose this behavior is nonsensical to me. If I was baking cookies and got halfway through the dough when I completely burned one batch, I wouldn’t just toss out the rest of the cookie dough and then refuse to make dinner.
isn’t the default forks=5? it doesn’t seem like increasing the number of parallel processes would address the issue of running g out of memory while parsing text output from hundreds of hosts
serial=50 means than instead of trying to parse and hold in memory text output from over 500 hosts, you only have to do it for 50 at a time.
unless the documentation is severely confusing, forks and serial are not the same thing. forks is number of hosts to run in parallel, serial is how many hosts to put through the play at until the list of hosts is exhausted
Yes, but if you set --forks 500 and serial 50, ansible will only fork
on the lower number, this is sometimes used as a fork limiter
(incorrectly) which is what i assumed you were doing.
Serial will batch the hosts to run through play, but that should not
create less/more memory consumption unless its bringing down the
number of forks.
ssh to the host and execute a command that returns a large amount of text
parse that text into json
send the json to an API
consider the difference between holding 50 objects that each take up 5 mb and possibly over 500 objects that each take up 5 mb. that’s why I’m trying to use serial - to limit the amount of memory I have to use at each step.
because I love metaphors:
I have to carry 500 bags from point A, put something in them at point B, and drop them off at point C. I’m not strong enough to carry all 500 filled bags at once, but I can string 10 on my arms, so I carry 10 at a time - I split them into batches. If at one point, all 10 bags that I’m carrying broke, I’m not going to give up and leave the rest of the bags unfilled and untransported. that’s how I want to use serial: to limit the weight my server has to carry at any given time.
That assumes ansible does not hold registred data until the end of the
run, but instead clears it when the host is done in play, that is an
incorrect assumption.
This is a good metaphor, and I’ve never understood why serial works this way. We’ve repeatedly had situations where we want to do things on a few hosts at a time, and continue even if one batch of them fails – especially if the batch size is 1 – and there’s just no way to do that.
I understand the use cases for wanting to fail if all your batches fail: To stick with your metaphor, maybe all ten bags in your batch broke, you stumbled, and sprained your ankle. At that point, you do want to stop, and not hurt yourself more.
That should be a decision that you can make on a case-by-case basis, though; saying “any failure == stop the whole play” seems obviously wrong. (And there are lots of other places where you can control how much failure == stop the whole play; this is just a weird exception to that principle.)