workaround for serial: 1 failures stopping the entire playbook?

Hello,

This has probably been addressed 1000 times before, but I can’t seem to find an answer (if this is even possible) on how, when running a play within a playbook on serial: 1, to have a node fail a task that would be fatal for the node, but not for the remaining nodes that have not run yet, and Ansible skip the rest of the play for just that one node, moving on to the next node in the batch.

I have a scenario where I want to perform OS patching on a large-ish group of servers in a hadoop cluster with no downtime to the cluster itself. So I am using serial: 1 when performing the patching tasks for each node - put it in maintenance mode, take it out of the cluster, patch, reboot, re-join the cluster, and do some basic health checks.

However if any one of these tasks fails in serial: 1 mode, Ansible considers the entire play failed and will not run against any remaining nodes. Since this is a large cluster (50 nodes), a failure on a single node isn’t a showstopper and shouldn’t stop the rest of the nodes from performing their OS patching.

I’d like to know if there is a way around Ansible stopping an entire play for all nodes if a single node fails when running in serial: 1. From what I’ve read on the google there doesn’t seem to be a way to do this short of setting serial: 2(+), but I thought I’d ask.

there are several ways, the simplest might be putting the whole thing
in a 'block' with a 'rescue' that always succeeds so it will go to the
next host.

Brian,

Thanks for the reply on this. I will definitely test this out in my plays.

Andrew

Hey Andrew - were you able to get anywhere with this? I tried adding a block/rescue without any luck. Searching all morning for a way to make ansible move onto the next host in a serial strategy even if one task on one host fails. I’m thinking it’s not possible.

Rob

It is possible.