On replaying notifications after failures

Hello all,

I am curious if/how Ansible plans on solving the “replay notifications” issue. I am having the exact same issue as reported on this StackOverflow question (the author does a great job describing the issue): http://stackoverflow.com/questions/21538516/ansible-how-to-replay-notifications

I find that it is easy to prevent that by, instead of using notifications, using tasks with “when:” statements. But I really find that this is a workaround and notifications are great and easy features – it should be a default way to replay them, or at least have a list of queued-but-not-executed notifications whenever a task fails.

Thanks,
jmonteiro

There was a post about this last week about adding a --force-handlers statement.

This can be done though we’re currently chasing some other items presently.

Pull requests would be welcome.

Usually the task that notifies a handler has made some changes that affect the behavior of systems involved by the actions being taken in the handler. If the notifying task fails, but its handler is forced to run, then the behavior of the involved systems could be unpredictable or unwanted. For example, if a task that changes a server’s configuration fails, forcing the execution of a handler that reloads/restarts the server could lead to the server failing to operate properly or be able to serve at all. So, I think that a ‘–force-handlers’ option is quite risky and could lead to unpredictable behavior. It would be better to let users control the (selective) replaying of the handlers only after the failure occurs.

"For example, if a task that changes a server’s configuration fails, forcing the execution of a handler that reloads/restarts the server could lead to the server failing to operate properly or be able to serve at all. "

This is why it’s a command line option to be used only when desired.

Being it a command line option does not help, because you do not know before running the command that any of your tasks is going to fail and how. I think that the correct analysis of the problem is like this: You have designed a sequence of deployment tasks that should be run in a specific order. Your task specification language (ansible playbook) does not know what is the best thing to do if task execution is abnormally interrupted. Only you, the ansible user, will be able to know what should be done. Normally you expect everything to run fine, but, if something goes wrong (and things can still go wrong on production deployments, even after passed tests), you want your action to stop as early as possible. At the same time you want to have the tools that will help you to mitigate the problems caused by the abnormal interruption. And you want full control over these tools. You do not want the tools to decide for you on what should be done. You are the one to decide, and you can’t do so before you actually see what and how it failed. So, you need a tool to run already notified handlers (or part of them), but you will use it only if it is good for the health of your system. You cannot decide about that until you actually see what happened.

“Being it a command line option does not help, because you do not know before running the command that any of your tasks is going to fail and how.”

Incorrect, because you would use this when running the retry command only.

Well, it did not occur to me that you could actually use that option after the failure. However, I think that a more controllable tool to run handlers selectively could be more powerful for recovering from unexpected deployment failures. For example, have ansible generate a file with a list of notified but not executed handlers, which you can edit as you want and then pass it to a ‘–handlers-file’ option, in a similar fashion to how limit files work for limiting hosts.

I wouldn’t want the system to generate two seperate files, but it could generate a new system for retries.

We can think about this.

I like the idea of having a powerful “–retry @retryfile” option with sensible defaults. The retry file could be as simple as a yaml file like the following: The “hosts” list would be auto-generated with the hosts that have failed, but it will be possible to remove/add some hosts, as well as use a host selection pattern instead of a list. The “start_at” would be auto-set to the task that has failed, since usually you don’t want to retry from the beginning. But could be removed to retry from the beginning or changed to another task before or after the task that failed. The last option (to retry starting after the failed task) could become useful in case you think that the failure is not that important and you don’t want to spend time fixing it at the time of the occurrence, but want a quick workaround by bypassing it at first and fixing it later. The “notify” directive would force the notification of the handlers in the list. This list would initially be auto-generated with handlers that had already been notified before the failure. The ansible user will have the option to manipulate the list according to what he thinks is best for recovering from the failure. He could remove some items from the list or remove the whole list. He could even add any extra handlers he thinks that are necessary. The “tags” directive would be auto-set to “all” to retry tasks whatever their tags may be, but could also be restricted by passing a list with specific tags. To make it even more powerful, the retry file could even support “pre_tasks” and “post_tasks” lists of one-time, ad-hoc tasks that the ansible user could quickly write to quickly work around unpredicted problems caused from an unexpected failure, before making a proper fix in his playbooks. What do you think?

I don’t think a start-at is reasonable because different tasks may fail at different points on different hosts.

It should only contain (for now) the list of tasks and handlers to force, and should rely on the playbook to do the right thing when re-run.

“list of tasks and handlers” => “list of failed hosts and handlers”.

This scenario could be easily handled by producing a separate retry file for each failed task. Each file would have a ‘start_at’ directive preset to the failed task and a ‘hosts’ list with the hosts that failed on this task. Also the ‘–retry’ would support passing multiple retry files.

"This scenario could be easily handled by producing a separate retry file for each failed task. "

I’m not a fan of this, it would generate too many retry files.

Then another solution would be to have a single file with a separate yaml section for each failed task, like it is with multiple plays in one playbook.

We’re not going to do that.

But I am open to a new retry file that lists handlers that didn’t succeed.

We need to attack other things int he queue before researching this though.