Hi Super People,
I wanted to ping the collective about two thoughts I’ve been having about error handling.
Idea One:
is an idea that was briefly discussed with Michael before, which is a “notify on failure” concept, in which handler(s) would be executed on failures as opposed to successes. Thus a task could have two sets of handlers, one when things are good, one when bad. The way I am thinking of this the host where the failure occurred would be taken out of the host list after executing the notify_on_failure handler(s)
example:
- hosts: all
tasks:
- name: upgrade code
action: yum pkg=foo state=newest
notify: test and start foo
notify_on_failure: rollback foo
My rationale for this is around code deployments. I know there are related ideas floating around about this, like: https://github.com/ansible/ansible/issues/1312, and https://github.com/ansible/ansible/issues/127
Generally curious if this is still interest here (I think there is, but not enough for actual work yet) or if there is mental momentum to another approach on triggering handlers when things don’t go right.
Idea Two:
A “global” stop on a playbook task if a host, or threshold of hosts fail a task. This is different from what I believe the current default behavior is: stopping executing a play on an individual host if a task fails (and ignore_errors is not set) but continuing on other hosts. The rationale here is again around code deployment. The desired outcome is for the task to stop executing on all hosts if the task fails on one, or a threshold of hosts. The notion is that a failed code deployment (lets say the failure is detected by a test script failure post yum triggered upgrade) should not have to go through the process of failing and rollback on all the other hosts if it is obviously a bad release. This idea could also be extended to allow a percentage of hosts in the task to fail. Lets say on a farm of a 100 app servers three are broken somehow, but not removed from inventory, you wouldn’t want to stop your deployment for three failed tests, but you would want to stop it for 15 failures (out of a 100).
Also I do see some extra complexity on implementing this as the play wouldn’t be able to stop executing the exact time a task fails but at the “top of the loop” for the next set of hosts in line for execution. Obviously this approach is not going to work if you are deploying code to 100 servers with an argument of -f 100 as they would all fail in parallel. This is expecting that code deployments would be executed in smaller chunks using small -f values or use of the serial play argument.
- hosts: all
stop_on_failure: True
or
stop_on_percentage_threshold: 15
tasks:
- name: upgrade code
action: yum pkg=foo state=newest
notify: test and start foo
notify_on_failure: rollback foo
My main goal here is to see if there is interest from the community on these ideas. From an app deployment perspective they seem useful, but it would be good to know if I am delusional or missing existing functionality.
cheers
Jonathan