SSH connection retry not working on `UNREACHABLE` status?

I have enabled

[ssh_connection]
retries=10

in ansible.cfg

however it seems retry is not being enforced for the following connection-related error:

PLAY [all] *********************************************************************
TASK [Gathering Facts] *********************************************************
fatal: [<hostname>]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host\r\nConnection closed by <ip address> port <port>", "unreachable": true}
PLAY RECAP *********************************************************************
<hostname> : ok=0    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0   

according to ansible/lib/ansible/plugins/connection/ssh.py at c75624fbdc72a12c85ead6f390015fffd7d58d78 · ansible/ansible · GitHub

Ansible retries connections only if it gets an SSH error with a return code of 255.

so this actual error encountered is not an SSH error with return code of 255?

If that is the case, what can I do to make sure this scenario also got SSH retries?

255 is how ssh itself returns ‘this is a connection error’, any other error will be about execution or authorization, not really the connection.

Hi,

so this actual error encountered is not an SSH error with return code of 255?

As told by @bcoca, there are other error codes returned by your ssh client on some failed events. IIRC, you can see it in verbose mode (-v) with enough ‘v’; I think you need 3 to print connection info. Error code should be in there.

Assuming you’re using bash and have a close enough ssh client config from your Ansible one, you could also get it from a command like: ssh <yourHost> <yourParameters> 2>/dev/null; echo $?

what can I do to make sure this scenario also got SSH retries?

Now I’m not sure or I don’t remember how to deal with unreachable hosts; on sure thing is that you can’t use block/rescue, nor failed_when. Perhaps using a callback plugin to issue n retries on failed connections that returns codes other than 255.

Also, have a look there: Error handling in playbooks — Ansible Documentation

You can use meta: clear_host_errors to reactivate all hosts, so subsequent tasks can try to reach them again.

So perhaps a looping ansible.builtin.ping task or whatever, with a subsequent meta: clear_host_errors until it works, though again, I don’t see a way to register unreachable hosts.

Or a script you’d run locally as pre-task that try to reach your inventory hosts through ssh and output return codes you could parse to identify unreachable hosts before Ansible does, and deal with them the way you want ?

Just some thoughts; there might be an easier solution.

That’s not “unreachable” in the ssh sense. Ssh reached, and the remote rejected your key. It’s unreachable in the ansible sense, in that ansible shouldn’t expect any better results in the next 9 attempts.

1 Like

As I mentioned above, authentication errors are not considered for retries. In most setups retrying an account with bad auth will end up locking it.