I am missing something about playbook orchestration features?

Jonathan1 · September 18, 2012, 6:32pm

Hello everyone,

Our team is trying to use Ansible to help us deploy updates of our web applications. We’ve been using Ansible since v0.4 and have a lot of Ansible infrastructure in place, so using the yum module functionality to deploy our updates (our web apps are packaged into RPMs and inserted into a private repo) makes a lot of sense for us (vs. using our current custom tools).

Generally speaking we want to do what I think is a fairly ‘normal’ process:

send out the update using the new v0.7 rolling updates feature to work on machines in groups, update via the yum module, and additionally use the delegation feature to execute post deployment functional tests and message the load balancer (Michael, thank you for both of these features, very excited about these)
if there is a failure of one node abort the entire deployment and return group members to the former code version

It is the latter step here that I’m not sure how to implement. ie: how to have the playbook react conditionally to failures. Lets say one host out of 10 fails one of its playbook steps. It is my understanding right now said host will not have any more actions deployed on it, but the tasks will proceed on the other hosts in the group. I am unclear if it is possible to stop playbook execution for all hosts if there is an error on a single host. What we’d want to avoid is having code go out on all the members of a group that is, for example, so messed up it won’t startup, and have the playbook continue to run on all hosts to the point where the post-deployment validation step fails, the playbook stops executing for that host, and then proceeds to the next host, thus leaving the load balanced pool with no members.

And if we were able to stop the entire playbook run, we’d still be left with N number of VMs (where N = number of hosts that had failed tasks) that are in some sort of failed state. Ideally we’d want some conditional logic that if particular tasks fail that operations can be run to their “clean up” the failed hosts or, more optimally for our plans, stop the deployment and revert all hosts to the prior release (even if they updated successfully).

I’ve got two goals with this post:

1, make sure that I am not missing some fundamental Ansible features related to the above questions
2. if I am describing functionality that does not exist if adding that functionality would be simply authoring new module(s) or adding some more fundamental logic extensions to how playbooks respond to failed tasks?

Thanks to everyone for their time and code contributions,
Jonathan

Michael_DeHaan · September 18, 2012, 7:12pm

There's a feature that should be added here.

We need to add something like:

tasks:
   - name: blarg
     action: blarg
     notify_on_failure: 'recover me'

handlers:
- name: 'recover me'
action: glorp

So if the task fails, it does NOT record as a failure except in stats
and then schedules the given handler to be run.

Patches would be accepted.

Most config management tools don't have a way to do this, and it would
be pretty cool, however, it's a bit hard to tell what specifically
would go wrong and what the appropriate response was. As such,
figuring out how to use this would be left up to the reader, and
obviously, rollbacks are all up to your application.

---Michael

Jonathan1 · September 18, 2012, 8:07pm

Michael,

Thanks for the response.

Regarding how to tell when something has gone wrong: in an Ansible-context (but specific to our plans) I’ve always thought of the evaluation of something being wrong would have to be custom code, likely executed by the shell module.

Is there interest from other community members in functionality like this? Or are others using different tools entirely for their code distribution (ControlTier, etc…)?

We’re going to look through the code to research the level-of-effort for our group to possibly come up with a patch proposal for this.

cheers,
Jonathan

Michael_DeHaan · September 18, 2012, 8:47pm

Michael,

Thanks for the response.

Regarding how to tell when something has gone wrong: in an Ansible-context
(but specific to our plans) I've always thought of the evaluation of
something being wrong would have to be custom code, likely executed by the
shell module.

Is there interest from other community members in functionality like this?

It doesn't matter if there isn't, there's quite a lot of interest from /me/.

Or are others using different tools entirely for their code distribution

TONS of folks are using ansible for app deployment -- it's the one
thing it does REALLY REALLY well compared to other
app tools that are order based or DSLey. If anything, this is what
Ansible is going to focus on to differentiate itself more
(orchestration in general, I'll share more on thoughts around this
later).

Anyway -- The whole reason for creating ansible was because I thought
the whole seperation between config+deploy+ad-hoc was terribly
terribly
broken, more or less.

The idea that someone needs to turn to Fabric or Capistrano from
within ansible should be looked at as 'what do we need to fix so
people don't have to switch contexts'. Simple as that.

(Infrastructure is data, not code -- deferral is mission failure)

We're going to look through the code to research the level-of-effort for our
group to possibly come up with a patch proposal for this.

It's about a 15 line patch, tops. I could knock it out, but I feel
it's important to get folks to share in the effort, so more people
learn the codebase and it evolves with the needs of people to work on
it in mind. If you get stuck on this, let me know.

I am assuming when a host fails it does not run anything else in
'tasks', and then just runs the notified failure handlers (but not the
regular notify handlers). Should be pretty straightforward.

Jonathan1 · September 18, 2012, 9:20pm

Michael,

Agreed on your assessment. I absolutely want to avoid running an additional deployment tool on top of all the Ansible work we have in place. Defeats the whole purpose…

We’ll report back, likely in a week or two as we need to sort out who exactly is going to work on this internally.

onwards and upwards,
Jonathan

Michael_DeHaan · September 18, 2012, 10:19pm

Offhand, there's more than one thing I'd like to see here.

Not only:

- notify_on_failure: "handler name"

But also a global concept

failure_handlers:
- name: foo
action: foo

For if we have a module, to say, notify some remote resource or take
some action in response, it would be great if that didn't have to be
done by adding "notify_on_failure" to every single thing.

There is probably also the idea of tagging a handler for
"notify_on_failure" to "recoverable: True" such that if the task had
failed, but the handler suceeds, maybe the playbook could keep
rolling. This is a bit of a nice-to-have-though, and wouldn't need
to be in the initial cut. I've been asked about it before though.

--Michael

Topic		Replies	Views
soliciting feedback on ideas about playbook error handling in regards to code deployment Ansible Project	10	0	November 28, 2012
ansible-playbook exits if host list empty, even if later plays could still run Ansible Developer	6	2	March 11, 2016
Really can't do this in Ansible? Get Help playbook	8	86	July 23, 2025
Forcing a stop on failure Ansible Project	2	16	October 22, 2015
Defining Actions to be run in the event of Failure for a Host Ansible Project aws	3	6	November 17, 2014

I am missing something about playbook orchestration features?

Related topics