I’ve thought about this a bunch, but it’s really hard. Many of our tasks require data from previous tasks, and often those previous tasks will “succeed” in one run only to have a later dependent task fail. Figuring out which ones to re-run there is nearly impossible.
Since Ansible is designed to support idempotence, we make sure that we can re-run any of our playbooks at will. Tasks which have already completed will finish fast as ‘unchanged’ and only the tasks that haven’t ran yet will cause new change. Trying to bake something into ansible to only re-run failed tasks is probably going to cause too many gotchas to be really useful.
For your own setup, you could make use of —start-at-task if you really know where you can skip ahead to.
Then would it be possible that we write failed tasks name to file and with a flag such as --failed-tasks-only, we would parse that file at the same location where we are handling start_at now?
It’s more of for development. It’s how we do unit testing where you write tests, expect some to fail but you only want to re-run failed tests only until it’s all green.
The same thing apply to ansible tasks. I might have 100 tasks and only 1 or 2 tasks failed. Now I have to re-run 100 tasks again just to check if I have fixed the 2 tasks that failed. It would be awesome if I just have to run 1,2 tasks that failed to quickly verify it during development. We actually spend a lot of time developing these playbook tasks to get it right.
Well the problem is if you just re-run the failed parts, you won’t be validating that the previous steps can run again cleanly on top a second time, right? In which case, running them again makes sense, as it will just go over the server policy and check to make sure everything is up to date.
I understand what you are saying about targetting specific parts of the config, and I do like tagged roles for that kind of thing pretty well.
Some people like --start-at-task, which sounds like it will do what you want though, start at that particular point. I don’t use it though.
Right now the retry file doesn’t record this and just returns a “–limit @filename.yml” type file. If it did, it might be more straightforward to make this an option, but we’d need something like a --retry-file or something.