dynamically generated inventory (add_host) and playbook failures

I’m building a playbook for patching our servers, however I keep on getting

2016-07-14 09:30:29,186 p=55840 u=dimon | PLAY [report] ******************************************************************
2016-07-14 09:30:29,211 p=55840 u=dimon | ERROR! invalid host (somerandomhost1.stanford.edu) specified for playbook iteration
2016-07-14 09:33:06,935 p=65806 u=dimon | [WARNING]: provided hosts list is empty, only localhost is available

Here’s a Gist of the playbook https://gist.github.com/droopy4096/72cabebf90ba76b009d128d7dd82eef2

In a nutshell: I’m using our server inventory DB (Pakiti) to extract list of hosts registered (wrote my own module for that). Then I walk through those hosts and select ones that are “alive” according to Pakiti into “pakiti_hosts” group. Then we have Facter facts on machines identifying their patching priority, so I do a round of “facter fact gathering” (which fails for some machines) and where it fails I set patch priority to 0. Then I’d like to execute certain set of commands across all those pakiti_hosts (which at present is a mere template being generated for report), but my playbook intermittently fails due to some hosts either not responding or some other things and I have to re-launch entire playbook from start.

It seems that Ansible is OK with connection failures and skipping over those hosts, however for some reason some of the other errors lead to the above message. I’ve tried to work around this by introducing “blacklist” group in playbook, but when working with 500+ machines - there’s always one or two that would fail. I’d like to complete the execution and revisit those boxes later. I’ve tried “ignore_errors” but I’d rather not add it to every block. What are my options?

I realize that I could collect hosts via dynamic inventory as well but this way it seemed more natural to me, leveraging ansible facilities for that.

In other words: how can I make my playbook more resilient?

Maybe setting a max failure percentage for the play would help?

http://docs.ansible.com/ansible/playbooks_delegation.html#maximum-failure-percentage

I’ve not used maximum-failure-percentage myself so I don’t know how easy it would be to identify the failed hosts to revisit though.

Jon

Thanks for the link Jon.

Interestingly enough documentation states: “By default, Ansible will continue executing actions as long as there are hosts in the group that have not yet failed.” However in my case it seems not to be the case as playbook aborts with a single failure. Which is what getting me confused.