I’ve been pretty happy running Ansible for a few months now. The one major thorn in my side is failed tasks. Our fleet of VMs is not very large, but apparently is large enough (or our playbook is long enough) that we hit at least one spurious SSH error (e.g. “SSH Error: mux_client_hello_exchange: write packet: Broken pipe”), or, more rarely, I’ll hit a spurious 500 from a third party service (e.g. adding or removing our VMs to/from load balancers via a cloud API).
What’s the best practice for dealing with these kinds of transient failures? It seems like me that something like “sleep X seconds, then retry, up to Y times” would work quite well, but it isn’t obvious to me how to make that happen.
I’m aware of the wait_for module, but I don’t think that really helps in this situation since the problem isn’t that a resource is actually missing; its just spurious failures.
My understanding of retry files (which could certainly be wrong) is that they merely limit the hosts that are included in the run. Which I don’t think will work for me, although perhaps this indicates that my playbook is not set up well. Here is a simplified version of my site.yml:
name: copy new files to all nodes
hosts: all
tasks:
include: tasks/deploy_files.yml
name: configure and deploy backend type foo
hosts: tag_foo
roles:
foo
name: configure and deploy backend type bar
hosts: tag_bar
roles:
bar
name: configure and deploy backend type baz
hosts: tag_baz
roles:
baz
(etc, for 7 total backend types)
name: clean up old deployments from all nodes
hosts: all
tasks:
include: tasks/remove_old_deployments.yml
So, given this structure, pretend that the “foo” step went fine, but then some step during one of the “bar” backend deployments failed. Won’t the retry file just contain that single host? (assuming we are running “serial: 1” for that task that failed) So if I reran using that file, I might get that “bar” host to deploy correctly, but I will totally miss all of the “baz” hosts and all other backends whose deployment tasks appear after the “bar” task.
I suppose one option might be to break up this single site.yml into 7 different playbooks, one for each backend type, and then execute them each in order, retrying each one as necessary if any errors occur. Would that be a better setup? That seems to be a bit silly, but maybe I’m wrong on that…