Ansible started giving SSH Error: data could not be sent to remote host error in several playbooks and several tasks

I have more than 42 ansible playbooks all of which used to run fine and nothing has changed. The SSH connectivity to all the hosts is also good.

Since last week we started getting SSH Error: data could not be sent to remote host "remotehost2". Make sure this host can be reached over ssh" in random playbooks at random tasks.

Here is a sample debug output where you see i m performing task in a loop and i get the above error and in the next iteration within the loop it starts working for the same host.

Debug Output:

TASK [replace] ***************************************************************** task path: /web/playbooks/automation/misc/passupdate/replaceallpswd.yml:2 Using module file /usr/lib/python2.7/site-packages/ansible/modules/files/replace.py ESTABLISH SSH CONNECTION FOR USER: wladmin SSH: EXEC ssh -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=wladmin -o ConnectTimeout=900 -o StrictHostKeyChecking=no -o ConnectionAttempts=5 remotehost2 ‘/bin/sh -c ‘"’"’/usr/bin/python && sleep 0’“'”‘’ failed: [remotehost2] (item=user_2021_2376) => { “item”: “user_2021_2376”, “msg”: “SSH Error: data could not be sent to remote host "remotehost2". Make sure this host can be reached over ssh”, “unreachable”: true } Using module file /usr/lib/python2.7/site-packages/ansible/modules/files/replace.py ESTABLISH SSH CONNECTION FOR USER: wladmin SSH: EXEC ssh -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=wladmin -o ConnectTimeout=900 -o StrictHostKeyChecking=no -o ConnectionAttempts=5 remotehost2 ‘/bin/sh -c ‘"’"’/usr/bin/python && sleep 0’“'”‘’ (0, ‘\n{“msg”: “1 replacements made”, “invocation”: {“module_args”: {“directory_mode”: null, “force”: null, “encoding”: “utf-8”, “replace”: “user_2021_2416”, “path”: “/web/bea_apps/applications/configurations/application-prod.yml”, “owner”: null, “follow”: false, “before”: null, “group”: null, “unsafe_writes”: null, “remote_src”: null, “setype”: null, “content”: null, “serole”: null, “selevel”: null, “after”: null, “regexp”: “DbVKtZ_Feb20”, “validate”: null, “src”: null, “seuser”: null, “delimiter”: null, “mode”: null, “attributes”: null, “backup”: true}}, “changed”: true, “backup_file”: “/web/bea_apps/applications/configurations/application-prod.yml.1161.2021-02-12@00:52:12~”}\n’, ‘\nThis system is for the use by authorized users only. All data contained\non all systems is owned by the company and may be monitored, intercepted,\nrecorded, read, copied, or captured in any manner and disclosed in any\nmanner, by authorized company personnel. Users (authorized or unauthorized)\nhave no explicit or implicit expectation of privacy. Unauthorized or improper\nuse of this system may result in administrative, disciplinary action, civil\nand criminal penalties. Use of this system by any user, authorized or\nunauthorized, constitutes express consent to this monitoring, interception,\nrecording, reading, copying, or capturing and disclosure.\n\nIF YOU DO NOT CONSENT, LOG OFF NOW.\n\n##################################################################\n# *** This Server is using Centrify *** #\n# *** Remember to use your Active Directory account *** #\n# *** password when logging in *** #\n##################################################################\n\n’) changed: [remotehost2] => (item=user_2021_2416) => { “backup_file”: “/web/bea_apps/applications/configurations/application-prod.yml.1161.2021-02-12@00:52:12~”, “changed”: true, “invocation”: { “module_args”: { “after”: null, “attributes”: null, “backup”: true, “before”: null, “content”: null, “delimiter”: null, “directory_mode”: null, “encoding”: “utf-8”, “follow”: false, “force”: null, “group”: null, “mode”: null, “owner”: null, “path”: “/web/bea_apps/applications/configurations/application-prod.yml”, “regexp”: “DbVKtZ”, “remote_src”: null, “replace”: “user_2021_2416”, “selevel”: null, “serole”: null, “setype”: null, “seuser”: null, “src”: null, “unsafe_writes”: null, “validate”: null } }, “item”: “user_2021_2416”, “msg”: “1 replacements made” } Using module file /usr/lib/python2.7/site-packages/ansible/modules/files/replace.py

Below is part of the playbook code:

---- replace:
path: “{{ outer_item }}”
regexp: “{{ vars[item] }}”
replace: “{{ item }}”
backup: yes
retries: 3
async: 1000
poll: 10
with_items: “{{ new_pass.split() }}”

I have tried the following but no luck.

  1. $ export ANSIBLE_SCP_IF_SSH=y 2. retries: 3 3. async: 1000 poll:10

Ansible Host system details:

$ free -mtotal used free shared buff/cache available
Mem: 3780 2339 537 112 903 1038
Swap: 2047 1004 1043

top top - 03:21:38 up 26 days, 15:04, 4 users, load average: 0.27, 0.39, 0.39 Tasks: 302 total, 1 running, 301 sleeping, 0 stopped, 0 zombie %Cpu(s): 7.2 us, 1.5 sy, 0.0 ni, 91.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 3871084 total, 548764 free, 2396476 used, 925844 buff/cache KiB Swap: 2097148 total, 1068068 free, 1029080 used. 1062532 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23130 ansblusr 20 0 304720 37380 1976 S 5.0 1.0 297:44.22 ansible-playboo …. ….

I feel the issue and problem is not with the code but with something else.

Need a solution that considers the code correct, but rectifies issues that may be beyond the code such as ssh settings, system limitations etc as an example.

What does the auth log on the failed hosts say?

It seems you still use python2.7, which is EOL:

/usr/lib/python2.7/site-packages/ansible/modules/files/replace.py

Maybe you could upgrade your python and the problem could be gone; but
no guarantees of course.

Not sure if you have found the answer but you might try a couple things to help investigate. It seems to me that something is not “awake” during the first transaction but then awakens and begins processing.

Does you company have leased lines that go dormant when not in use? Have you tried sending the same ssh command manually from the cli:
Does it always fail with user_2021_2376 or is it random?
Is it the first item in the loop everytime or can it happen during later iterations?
Does it only occur once or does it happen more often.
Does it always fail with
Do the servers have multiple network interfaces?
Can you perform tcp dump or similar on both boxes to verify network traffic is reaching and being replied to on the correct interface and route.
Looking at the ssh command, does it have all the correct options?

Have you tried sending the same ssh command manually from the cli?

ssh
-o KbdInteractiveAuthentication=no
-o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey
-o PasswordAuthentication=no
-o User=wladmin
-o ConnectTimeout=900
-o StrictHostKeyChecking=no
-o ConnectionAttempts=5
remotehost2
‘/bin/sh -c ‘"’"’/usr/bin/python && sleep 0’“'”‘’

You can replace this:
/bin/sh -c ‘"’“‘/usr/bin/python && sleep 0’”‘"’’
with something simpler like hostname and run that manually and see if the failure occurs again. Also add -v up to -vvvv to ssh for debugging.

ssh -q -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=wladmin -o ConnectTimeout=900 -o StrictHostKeyChecking=no -o ConnectionAttempts=5 remotehost2 `hostname’ -vvvv

Working in network ops for 20+ years and I find that two things cause most problems.

  1. patching/upgrades or config changes.
  2. hardware failures

You may have a network hardware issue but … although you may not have changed anything I would review any change controls the morning before you noticed these failures.

hope some of this helps, good luck.