On replaying notifications after failures

Julio_Monteiro · March 15, 2014, 5:12pm

Hello all,

I am curious if/how Ansible plans on solving the “replay notifications” issue. I am having the exact same issue as reported on this StackOverflow question (the author does a great job describing the issue): http://stackoverflow.com/questions/21538516/ansible-how-to-replay-notifications

I find that it is easy to prevent that by, instead of using notifications, using tasks with “when:” statements. But I really find that this is a workaround and notifications are great and easy features – it should be a default way to replay them, or at least have a list of queued-but-not-executed notifications whenever a task fails.

Thanks,
jmonteiro

Michael_DeHaan1 · March 17, 2014, 1:51pm

There was a post about this last week about adding a --force-handlers statement.

This can be done though we’re currently chasing some other items presently.

Pull requests would be welcome.

Ernest0x · March 18, 2014, 12:22pm

Usually the task that notifies a handler has made some changes that affect the behavior of systems involved by the actions being taken in the handler. If the notifying task fails, but its handler is forced to run, then the behavior of the involved systems could be unpredictable or unwanted. For example, if a task that changes a server’s configuration fails, forcing the execution of a handler that reloads/restarts the server could lead to the server failing to operate properly or be able to serve at all. So, I think that a ‘–force-handlers’ option is quite risky and could lead to unpredictable behavior. It would be better to let users control the (selective) replaying of the handlers only after the failure occurs.

Michael_DeHaan1 · March 18, 2014, 12:44pm

"For example, if a task that changes a server’s configuration fails, forcing the execution of a handler that reloads/restarts the server could lead to the server failing to operate properly or be able to serve at all. "

This is why it’s a command line option to be used only when desired.

Ernest0x · March 18, 2014, 2:18pm

Being it a command line option does not help, because you do not know before running the command that any of your tasks is going to fail and how. I think that the correct analysis of the problem is like this: You have designed a sequence of deployment tasks that should be run in a specific order. Your task specification language (ansible playbook) does not know what is the best thing to do if task execution is abnormally interrupted. Only you, the ansible user, will be able to know what should be done. Normally you expect everything to run fine, but, if something goes wrong (and things can still go wrong on production deployments, even after passed tests), you want your action to stop as early as possible. At the same time you want to have the tools that will help you to mitigate the problems caused by the abnormal interruption. And you want full control over these tools. You do not want the tools to decide for you on what should be done. You are the one to decide, and you can’t do so before you actually see what and how it failed. So, you need a tool to run already notified handlers (or part of them), but you will use it only if it is good for the health of your system. You cannot decide about that until you actually see what happened.

Michael_DeHaan1 · March 18, 2014, 2:20pm

“Being it a command line option does not help, because you do not know before running the command that any of your tasks is going to fail and how.”

Incorrect, because you would use this when running the retry command only.

Ernest0x · March 18, 2014, 2:40pm

Well, it did not occur to me that you could actually use that option after the failure. However, I think that a more controllable tool to run handlers selectively could be more powerful for recovering from unexpected deployment failures. For example, have ansible generate a file with a list of notified but not executed handlers, which you can edit as you want and then pass it to a ‘–handlers-file’ option, in a similar fashion to how limit files work for limiting hosts.

Michael_DeHaan1 · March 18, 2014, 2:42pm

I wouldn’t want the system to generate two seperate files, but it could generate a new system for retries.

We can think about this.

Ernest0x · March 19, 2014, 10:29am

I like the idea of having a powerful “–retry @retryfile” option with sensible defaults. The retry file could be as simple as a yaml file like the following: The “hosts” list would be auto-generated with the hosts that have failed, but it will be possible to remove/add some hosts, as well as use a host selection pattern instead of a list. The “start_at” would be auto-set to the task that has failed, since usually you don’t want to retry from the beginning. But could be removed to retry from the beginning or changed to another task before or after the task that failed. The last option (to retry starting after the failed task) could become useful in case you think that the failure is not that important and you don’t want to spend time fixing it at the time of the occurrence, but want a quick workaround by bypassing it at first and fixing it later. The “notify” directive would force the notification of the handlers in the list. This list would initially be auto-generated with handlers that had already been notified before the failure. The ansible user will have the option to manipulate the list according to what he thinks is best for recovering from the failure. He could remove some items from the list or remove the whole list. He could even add any extra handlers he thinks that are necessary. The “tags” directive would be auto-set to “all” to retry tasks whatever their tags may be, but could also be restricted by passing a list with specific tags. To make it even more powerful, the retry file could even support “pre_tasks” and “post_tasks” lists of one-time, ad-hoc tasks that the ansible user could quickly write to quickly work around unpredicted problems caused from an unexpected failure, before making a proper fix in his playbooks. What do you think?

Michael_DeHaan1 · March 19, 2014, 12:26pm

I don’t think a start-at is reasonable because different tasks may fail at different points on different hosts.

It should only contain (for now) the list of tasks and handlers to force, and should rely on the playbook to do the right thing when re-run.

Michael_DeHaan1 · March 19, 2014, 12:27pm

“list of tasks and handlers” => “list of failed hosts and handlers”.

Ernest0x · March 19, 2014, 1:11pm

This scenario could be easily handled by producing a separate retry file for each failed task. Each file would have a ‘start_at’ directive preset to the failed task and a ‘hosts’ list with the hosts that failed on this task. Also the ‘–retry’ would support passing multiple retry files.

Michael_DeHaan1 · March 19, 2014, 1:56pm

"This scenario could be easily handled by producing a separate retry file for each failed task. "

I’m not a fan of this, it would generate too many retry files.

Ernest0x · March 19, 2014, 2:28pm

Then another solution would be to have a single file with a separate yaml section for each failed task, like it is with multiple plays in one playbook.

Michael_DeHaan1 · March 19, 2014, 2:55pm

We’re not going to do that.

But I am open to a new retry file that lists handlers that didn’t succeed.

We need to attack other things int he queue before researching this though.

Topic		Replies	Views
feature idea: remember the notified handlers in .retry files on fail/error Ansible Developer	6	7	November 1, 2013
Handlers and failures Ansible Project	5	5	February 9, 2014
How to replay notifications Ansible Project	0	5	February 5, 2014
Feature idea: remember notifications when errors occured Ansible Project	2	3	June 13, 2013
Retrying notify handlers on subsequent runs? Ansible Project ansible-project	2	6	December 18, 2012

On replaying notifications after failures

Related topics