Occasionally, I get SSH timeouts or other network errors in the middle of a play. These cause that task to fail, but not the whole playbook, which continues on, but without the node that failed. You see output something like this:
…
2014-11-24 23:54:08,717 p=21193 u=ubuntu | TASK: [common | Install latest pip from PyPi] ****************************
2014-11-24 23:54:12,034 p=21193 u=ubuntu | changed: [box0]
2014-11-24 23:54:09,443 p=21193 u=ubuntu | fatal: [box1] => SSH encountered an unknown error during the connection. We recommend you re-run the command using -vvvv, which will enable SSH debugging output to help diagnose the issue
2014-11-24 23:54:11,001 p=21193 u=ubuntu | changed: [box2]
2014-11-24 23:54:11,278 p=21193 u=ubuntu | changed: [box3]
2014-11-24 23:54:12,305 p=21193 u=ubuntu | changed: [box4]
…
You eventually get this at the bottom of the output:
box0: ok=29 changed=23 unreachable=0 failed=0
box1: ok=18 changed=15 unreachable=1 failed=0
box2: ok=77 changed=61 unreachable=0 failed=0
box3: ok=77 changed=62 unreachable=0 failed=0
box4: ok=77 changed=62 unreachable=0 failed=0
This means that box1 doesn’t get anything that should have happened after the task that failed, because it got dropped from the play - and all groups - at that point.
This means that I occasionally get one cluster node that doesn’t work properly, because it was only partially provisioned/setup.
Is there a way to make this SSH error fatal and to stop the whole playbook/ansible run at that point - the same way an error in a task itself would?