soliciting feedback on ideas about playbook error handling in regards to code deployment

Hi Super People,

I wanted to ping the collective about two thoughts I’ve been having about error handling.

Idea One:

is an idea that was briefly discussed with Michael before, which is a “notify on failure” concept, in which handler(s) would be executed on failures as opposed to successes. Thus a task could have two sets of handlers, one when things are good, one when bad. The way I am thinking of this the host where the failure occurred would be taken out of the host list after executing the notify_on_failure handler(s)

example:

- hosts: all

tasks:

- name: upgrade code

action: yum pkg=foo state=newest

notify: test and start foo

notify_on_failure: rollback foo

My rationale for this is around code deployments. I know there are related ideas floating around about this, like: https://github.com/ansible/ansible/issues/1312, and https://github.com/ansible/ansible/issues/127

Generally curious if this is still interest here (I think there is, but not enough for actual work yet) or if there is mental momentum to another approach on triggering handlers when things don’t go right.

Idea Two:

A “global” stop on a playbook task if a host, or threshold of hosts fail a task. This is different from what I believe the current default behavior is: stopping executing a play on an individual host if a task fails (and ignore_errors is not set) but continuing on other hosts. The rationale here is again around code deployment. The desired outcome is for the task to stop executing on all hosts if the task fails on one, or a threshold of hosts. The notion is that a failed code deployment (lets say the failure is detected by a test script failure post yum triggered upgrade) should not have to go through the process of failing and rollback on all the other hosts if it is obviously a bad release. This idea could also be extended to allow a percentage of hosts in the task to fail. Lets say on a farm of a 100 app servers three are broken somehow, but not removed from inventory, you wouldn’t want to stop your deployment for three failed tests, but you would want to stop it for 15 failures (out of a 100).

Also I do see some extra complexity on implementing this as the play wouldn’t be able to stop executing the exact time a task fails but at the “top of the loop” for the next set of hosts in line for execution. Obviously this approach is not going to work if you are deploying code to 100 servers with an argument of -f 100 as they would all fail in parallel. This is expecting that code deployments would be executed in smaller chunks using small -f values or use of the serial play argument.

- hosts: all

stop_on_failure: True

or

stop_on_percentage_threshold: 15

tasks:

- name: upgrade code

action: yum pkg=foo state=newest

notify: test and start foo

notify_on_failure: rollback foo

My main goal here is to see if there is interest from the community on these ideas. From an app deployment perspective they seem useful, but it would be good to know if I am delusional or missing existing functionality.

cheers
Jonathan

There's been enough sporadic interest in some kind of failure handler
that would run on error and allow to 'recover' the play.
Unfortunately the architecture of the system is designed so that all
hosts in a "batch" (or all hosts in a play, if serial is not set) are
running the same task at the same time.

I am generally leery of the idea that a rollback ever works,
especially when data is involved.

Ultimately I worry about adding too much magic and semantics to
Ansible that is specific to a few niche use cases, or when others
would possibly want to do something differently.

Open to ideas, but I want to avoid adding additional keywords and
complexity where possible.

Right now you *sort of* can implement this using "ignore_errors" and
"only_if" in conjunction, though it's a bit gross. Maybe some
syntactic sugar around that, rather than using handlers would be
better.

--Michael

comments inline

There’s been enough sporadic interest in some kind of failure handler
that would run on error and allow to ‘recover’ the play.
Unfortunately the architecture of the system is designed so that all
hosts in a “batch” (or all hosts in a play, if serial is not set) are
running the same task at the same time.

I am generally leery of the idea that a rollback ever works,
especially when data is involved.

well… I’d like to be optimistic that a rollback can work :wink: My primary goal is to make sure that post playbook execution I don’t have a service that has fallen on its face, so lets just say following that your argument that you are not technically rolling back but say running a tasks w/ delegate to tell your LB to switch to a backup pool leaving your totally horked install in place… the key thing for me is reacting, conditionally, to failures and taking different execution paths. More on that below.

Ultimately I worry about adding too much magic and semantics to
Ansible that is specific to a few niche use cases, or when others
would possibly want to do something differently.

Open to ideas, but I want to avoid adding additional keywords and
complexity where possible.

Right now you sort of can implement this using “ignore_errors” and
“only_if” in conjunction, though it’s a bit gross. Maybe some
syntactic sugar around that, rather than using handlers would be
better.

I see where you are going with this. I’ve been playing around with some examples and for the rather simplistic deployment process I am thinking about I think we could cobble something together with ignore_errors, register, and then only_if or with the new when_[string / integer / unset, etc…]. I fear the playbook is going to be a little ugly, but could likely get the job done in terms of conditionally executing tasks if something goes wrong. So I’ll try down this road first vs. working on adding a “notify_on_error” and I’ll see how the results are.

This still leaves a question for me about aborting playbook execution entirely, and if anyone sees value in that. Here is a more specific example that could likely apply to other users:

A simple web app using a relational database and using external tool for schema updates. The goal is to deploy a schema update and then deploy our application code. What I don’t want to do is continue through the playbook if the schema update fails to execute. If I know I don’t have the required schema in place, it makes no sense to continue to the next play in the playbook… Technically it would be possible to put a only_if or when_string, etc… to check for the state of the schema update result, but that will get extra ugly if I am using only_if, when_string
etc… for error handling on the deployment (which I’ve illustrated in the example below).

Advice on either handling this or input on possibly playbook language extensions would be most appreciated.

(example playbook for above scenario follows):

Ultimately I worry about adding too much magic and semantics to
Ansible that is specific to a few niche use cases, or when others
would possibly want to do something differently.

Open to ideas, but I want to avoid adding additional keywords and
complexity where possible.

Right now you *sort of* can implement this using "ignore_errors" and
"only_if" in conjunction, though it's a bit gross. Maybe some
syntactic sugar around that, rather than using handlers would be
better.

I see where you are going with this. I've been playing around with some
examples and for the rather simplistic deployment process I am thinking
about I think we could cobble something together with ignore_errors,
register, and then only_if or with the new when_[string / integer / unset,
etc...]. I fear the playbook is going to be a little ugly, but could likely
get the job done in terms of conditionally executing tasks if something goes
wrong. So I'll try down this road first vs. working on adding a
"notify_on_error" and I'll see how the results are.

This still leaves a question for me about aborting playbook execution
entirely, and if anyone sees value in that. Here is a more specific example
that could likely apply to other users:

A simple web app using a relational database and using external tool for
schema updates. The goal is to deploy a schema update and then deploy our
application code. What I don't want to do is continue through the playbook
if the schema update fails to execute. If I know I don't have the required
schema in place, it makes no sense to continue to the next play in the
playbook... Technically it would be possible to put a only_if or
when_string, etc... to check for the state of the schema update result, but
that will get extra ugly if I am using only_if, when_string
etc... for error handling on the deployment (which I've illustrated in the
example below).

I think I'd accept something like any_errors_fatal: True/False
(default False) to extend precent behavior.

Right now a whole set of hosts has to fail all the way through, or a
batch size if using serial: N.

i don’t know if this makes sense, but extending only_if outside of each task to group tasks, might go a long way to provide more branch control:

only_if: $result.rc
-name: ladi
action: whatever

- name: da
action: blah

- include: moreactions.yml

Yeah, this is not going to happen.

When ansible was created it was designed to NOT be a programming language.

In most cases, if you want this kind of behavior, you are doing it to
check on certain props, which is when "group_by" starts to kick major
butt.

group_by your distro type, and then make a play that does things to
just your BSD hosts, or just your hosts with not a lot of RAM, etc,
and completely skip needing to use only_if.

I maintain that only_if is LARGELY a sign that something is modelled a
bit wrong.

There are of course, exceptions, and I know people disagree with me
and make heavy usage of it. Personally, I think it looks a bit ugly,
but it's a neccessary evil.

In any event, the nested grouping stuff isn't going to happen, as I
made a concious decision in the begining to minimize the amount of
YAML or programming-ness that entered into Ansible, and I think it's
done us well to keep that complexity in the language out of it.

In any event, another plug for group_by. See the module docs if this
is new to you!

--Michael

I think it looks a bit ugly,but it's a necessary evil.

I couldn't agree more. I remember reading about group_by, but completely
forgot about it, makes my suggestion irrelevant.

I have a case where group_by could be made more useful (but cannot be used as-is). The software is called OSSEC, and for the installation the agent requires a key that is (pre)created on the server.

Since creating the key (and doing the other enterprisy stuff we do) takes a long time, we only really want to call that script if the key has not already been created. Since this is (possibly) during provisioning, the server may not be up and running (yet).

So based on the output of a raw command (which is a grep of the hostname in the line-based keys file) we want to determine which systems still need a key generated. So what we really want to do is something like:

////////
- name: test-key
   hosts: linux
   user: some_user
   gather_facts: no
   tasks:
   - action: raw grep "${network_fqdn}" ~/ossec/local.client.keys 2>/dev/null
     delegate_to: ${ossecservers[0]}
     register: ossec_key

   - action: group_by key="need-ossec-key"
     only_if: not is_set('${ossec_key}')

- name: create-key
   hosts: need-ossec-key
   user: some_user
   serial: 1
   gather_facts: no
   tasks:
   - action: raw /some/path/to/some/script -c '${network_fqdn} -some -option
     delegate_to: ${ossecservers[0]}
     register: script

- name: test-key-again
   hosts: need-ossec-key
   user: some_user
   gather_facts: no
   tasks:
   - action: raw grep "${network_fqdn}" ~/ossec/local.client.keys 2>/dev/null
     delegate_to: ${ossecservers[0]}
     register: ossec_key

   - action: group_by key="lack-ossec-key"
     only_if: not is_set('${ossec_key}')

- name: escalate
   hosts: lack-ossec-key
   gather_facts: no
   tasks:
   - local_action: |
       mail to="someone" subject="failed"

   - action: fail msg="failed"
////////

So basicly, group_by only works if you have a clear variable that can be used to group by. In this case we do not have such a variable, we only have some evaluation, and only_if cannot be used together with group_by.

So what I would need in this case is a way to evaluate which servers should be grouped, e.g.:

   - action: group_by key="need-ossec-key" when="not is_set('${ossec_key}')

Would such a change be acceptable ?

So I'm having some issues with that hypothetical syntax but I grasp the idea.

The idea is key is what you are grouping around, it is the name of a
fact or variable. The key has to change based on the name of the
group.

What you are passing in to the new hypothethical "when=" seems more
like an "eval=" to me. However that seems like it needs more
simplification too.

It seems, (shudder) that what this really needs is an eval "lookup
plugin" to make this work, that evaluates a python string, and then
the module can stay dumb.

group_by: key="need-osc-key-$EVAL(2+2)"

Basically it would assume what was inside the EVAL was a python string.

Obviously this is much easier in Jinja2.

Ultimately I'm drawn on this, the syntax makes my eyes bleed, and this
feels unlike Ansible, especially when you throw the "is_set" and so on
in there. I think you are better of using only_if for your specific
case.

--Michael

Ok, so Daniel made this patch which makes it all much easier.

     https://gist.github.com/4161687

This makes group_by work with conditionals as most people will expect.

So I can now simply do:

////////
- name: test-key
   hosts: linux
   user: some_user
   gather_facts: no

   tasks:
   - action: raw grep "${network_fqdn}" ossec/local.client.keys 2>/dev/null
     delegate_to: ${ossecservers[0]}
     register: ossec_key
     when_unset: $ossec_key

   - action: group_by key="need-ossec-key"
     when_bool: ${ossec_key.stdout}

- name: create-key
   hosts: need-ossec-key
   user: some_user
   gather_facts: no
   serial: 1

   tasks:
   - action: raw /some/path/to/some/script -c ${network_fqdn} -more -options
     delegate_to: ${ossecservers[0]}
     register: script

- name: test-key-again
   hosts: need-ossec-key
   user: some_user
   gather_facts: no

   tasks:
   - action: raw grep "${network_fqdn}" ossec/local.client.keys 2>/dev/null
     delegate_to: ${ossecservers[0]}
     register: ossec_key

   - action: group_by key="lack-ossec-key"
     when_bool: ${ossec_key.stdout}

- name: escalate
   hosts: lack-ossec-key
   gather_facts: no

   tasks:
   - action: mail to="ossec-team" subject="script fails with ${script.stdout}"
   - action: fail msg="Problem generating ossec key"
////////

This works well.

Now I need to find out why:

  - the raw module stalls for some ssh connections randomly (for no
    particular reason systems that work stall forever the next run, usually
    between 1-3 in a run of 25) using ssh transport

  - the raw module returns successful with no output when there is an
    authentication error (and when delegated) using ssh transport

How I love Solaris :slight_smile:

Yeah, I like this. This is clean, ship it!