Retrying failed tasks

Ian_Rose · August 10, 2015, 7:29pm

Hi all -

I’ve been pretty happy running Ansible for a few months now. The one major thorn in my side is failed tasks. Our fleet of VMs is not very large, but apparently is large enough (or our playbook is long enough) that we hit at least one spurious SSH error (e.g. “SSH Error: mux_client_hello_exchange: write packet: Broken pipe”), or, more rarely, I’ll hit a spurious 500 from a third party service (e.g. adding or removing our VMs to/from load balancers via a cloud API).

What’s the best practice for dealing with these kinds of transient failures? It seems like me that something like “sleep X seconds, then retry, up to Y times” would work quite well, but it isn’t obvious to me how to make that happen.

I’m aware of the wait_for module, but I don’t think that really helps in this situation since the problem isn’t that a resource is actually missing; its just spurious failures.

Any suggestions?

Thanks!

Ian

Brian_Coca · August 10, 2015, 7:37pm

You can use the .retry files as a --limit to rerun the plays.

Ian_Rose · August 10, 2015, 8:11pm

My understanding of retry files (which could certainly be wrong) is that they merely limit the hosts that are included in the run. Which I don’t think will work for me, although perhaps this indicates that my playbook is not set up well. Here is a simplified version of my site.yml:

name: copy new files to all nodes
hosts: all
tasks:
include: tasks/deploy_files.yml
name: configure and deploy backend type foo
hosts: tag_foo
roles:
foo
name: configure and deploy backend type bar
hosts: tag_bar
roles:
bar
name: configure and deploy backend type baz
hosts: tag_baz
roles:
baz

(etc, for 7 total backend types)

name: clean up old deployments from all nodes
hosts: all
tasks:
include: tasks/remove_old_deployments.yml

So, given this structure, pretend that the “foo” step went fine, but then some step during one of the “bar” backend deployments failed. Won’t the retry file just contain that single host? (assuming we are running “serial: 1” for that task that failed) So if I reran using that file, I might get that “bar” host to deploy correctly, but I will totally miss all of the “baz” hosts and all other backends whose deployment tasks appear after the “bar” task.

I suppose one option might be to break up this single site.yml into 7 different playbooks, one for each backend type, and then execute them each in order, retrying each one as necessary if any errors occur. Would that be a better setup? That seems to be a bit silly, but maybe I’m wrong on that…

Thanks,
Ian

Brian_Coca · August 10, 2015, 8:39pm

I don't know if retry works well with serial.

Topic		Replies	Views
How to continue failed playbook on several hosts ? Ansible Project	4	13	January 3, 2014
New 1.2 feature, retrying only hosts with failures Ansible Project	5	8	April 23, 2013
Execute a playbook till the end regardless of the amount of failed plays in it Ansible Developer	0	2	May 19, 2020
Problem with retry file. Ansible Project	1	15	October 19, 2015
Playbook retry support Ansible Developer	2	56	February 21, 2015

Retrying failed tasks

Related topics