Inconsistent WinRM deployements

Hi. I am having issues running playbooks against windows servers, with consistent results. I can execute a playbook on a server, and it will be successful. I will attempt to run the same playbook on the same server again (10-15 minutes later) and it will fail. I am having this issue on all my Windows servers, and all playbooks. I can come back later, and then the playbook execution will be successful. Sometimes a reboot helps, sometimes it doesn’t.


Error
task path: /runner/project/win_reboot.yaml:6

23

fatal: [srv2]: UNREACHABLE! => {“changed”: false, “elapsed”: 0, “msg”: “plaintext: HTTPConnectionPool(host=‘srv2’, port=5985): Max retries exceeded with url: /wsman (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x77b5990103a0>: Failed to establish a new connection: [Errno -2] Name or service not known’))”, “rebooted”: false, “unreachable”: true}


  • I am using local administrator credentials, and WinRM HTTP. All servers are a part of the same network/subnet.

“Name or service not known” error implies you probably have some intermittent DNS resolving issues. Do you have host IP address specified in the inventory?

Thank you for the response. I am using hostnames, not IP addresses.

I just attempted 2 different servers by their IP’s, and got this error:

fatal: [192.168.1.177]: UNREACHABLE! => {“changed”: false, “elapsed”: 0, “msg”: “ssl: HTTPSConnectionPool(host=‘192.168.1.177’, port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError(‘<urllib3.connection.HTTPSConnection object at 0x7039b0d2f0a0>: Failed to establish a new connection: [Errno 111] Connection refused’))”, “rebooted”: false, “unreachable”: true}
PLAY RECAP *********************************************************************
192.168.1.177 : ok=0 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0

Well, now you have a different error message - Connection refused. Either those two different servers do not have WinRM configured or some firewall is blocking you.

Yep, those are the 2 error messages I get. They will alternate between the 2 errors and then occasional successful run. I have all firewalls turned off. I am not quite sure how to troubleshoot this. Any ideas?

Both are network related. The first one is related to connectivity to DNS resolvers, the second one to the connectivity to the hosts themselves. Maybe your control node is experiencing network connectivity issues?

I’d suggest looking at the logs of your control node for any evidence of network interface failing. Configuring some monitoring like ping and port 5986 connectivity from control node to the hosts can help you get more insight into the problem. Intermittent connectivity issues can for example be caused by MTU limitations. I’m shooting in the dark here.

You also appear to have inconsistent connection settings on the Ansible side of things.

In your first error, Name or Service Unknown does imply DNS issues, but I also noticed that you are performing winrm over http connections (port 5985) using a hostname. Then in your second error, you’re getting connection refused for winrm over https (port 5986) using an ipv4 address.

The first issue may be intermittent DNS problems, but the second could be WinRM security or TLS related. Are your hosts even configured for winrm over https? That requires at least self-signed certs to provision, but then you would also need to make sure Ansible is configured to ignore untrusted certs. If your hosts are using certs signed by a root CA, then you could install the CA and point your inventory connection settings to trust that CA for winrm. Then, if Ansible can trust the certs, you would need to be using the FQDN of the host instead of its IPAddress.

Which leads me back to the first error. Is srv2 the actual hostname? This isn’t an FQDN. Windows is able to use NETBIOS to resolve neighboring hosts on their local network using short hostnames, but Linux doesn’t really do NETBIOS. These hosts would need to be found in a search domain to resolve a DNS entry for Linux to use short hostnames.

Edit: Forgot to mention: on the Windows side, winrm listeners for http and https are configured separately. So, even if you configured both http(s) services, the listeners may not be configured on both to allow connections from your Ansible control node.

1 Like

Good observation, although I doubt the second case is TLS related. Before TLS session is established, a TCP connection has to be already established. “Connection refused” implies that attempt to establish TCP connection was not successful and thus establishing TLS session was never attempted.

On the other hand, @Jason_Longley claims that each particular host sometimes work and sometimes not so one can only assume intermittent network connectivity issues.