Hello,
This has probably been addressed 1000 times before, but I can’t seem to find an answer (if this is even possible) on how, when running a play within a playbook on serial: 1, to have a node fail a task that would be fatal for the node, but not for the remaining nodes that have not run yet, and Ansible skip the rest of the play for just that one node, moving on to the next node in the batch.
I have a scenario where I want to perform OS patching on a large-ish group of servers in a hadoop cluster with no downtime to the cluster itself. So I am using serial: 1 when performing the patching tasks for each node - put it in maintenance mode, take it out of the cluster, patch, reboot, re-join the cluster, and do some basic health checks.
However if any one of these tasks fails in serial: 1 mode, Ansible considers the entire play failed and will not run against any remaining nodes. Since this is a large cluster (50 nodes), a failure on a single node isn’t a showstopper and shouldn’t stop the rest of the nodes from performing their OS patching.
I’d like to know if there is a way around Ansible stopping an entire play for all nodes if a single node fails when running in serial: 1. From what I’ve read on the google there doesn’t seem to be a way to do this short of setting serial: 2(+), but I thought I’d ask.