Determining unreachable vs error when using the raw module

In our environment, we try to be tolerant of systems in our inventory that are unreachable, as with an inventory of our size we constantly have new ones going MIA. In our deployments we really care about errors on the hosts we CAN reach. To that end we put in a PR to get ansible(-playbook) to exit with different return codes when hosts had errors vs just having unreachable hosts.

Things were great until today, when another engineer put in a change to make use of the raw module (to start with the assumption of NOT having python on the remote system, so raw was used to install python). The change worked fine in a smaller environment where the entire inventory was reachable, but when we hit a larger set with some unreachable hosts we started seeing failed=1 returns, rather than unreachable=1. This made ansible exit with an error, which stopped further jobs in the build flow.

Short term, we've backed out the use of raw, but long term I wanted to bring this up here, to see if there is value in having raw distinguish between unreachable or not, or if this is a known issue with an impossible solution.

Thoughts?

-jlk

So raw is basically just doing SSH calls without pushing a module.

That all being said, if you can discern that no connectivity over SSH is achieved, it should be fine to return something different from the exec_command call.

At least right now that function is:

https://github.com/ansible/ansible/blob/devel/lib/ansible/runner/connection_plugins/ssh.py#L237

It seems the easiest solution would be to make UnreachableError a subclass of AnsibleError, so the code could detect appropriately.

Stepping through, I can get to line 347. Here we have a return code of 255, but in_data is not True, so we don't raise the AnsibleError. Instead we fall through to the return of the tuple of return code and stdout/err.

My naive thinking is that we'd want to add to this if statement a way to determine if we're using the raw module, and treat a 255 in the raw module as the connection error.

Does that make sense?

-jlk

https://github.com/ansible/ansible/pull/6711 was sent, may not be actually right, but seemed to work here in simple tests.

-jlk