Windows playbooks causes cryptographic, windows remote management services to crash

Since upgrading to ansible 2.0 my windows playbooks have been failing with the following error. This error has been seen when running setup, win_template, script tasks. The easiest way to repeat it is to have multiple simultaneous runs of ansible affecting the same host. If we re-run the exact same playbook after a failure they almost always succeed.

Traceback (most recent call last):
File “/usr/lib/python2.6/site-packages/ansible/plugins/connection/winrm.py”, line 240, in exec_command
result = self._winrm_exec(cmd_parts[0], cmd_parts[1:], from_exec=True)
File “/usr/lib/python2.6/site-packages/ansible/plugins/connection/winrm.py”, line 208, in _winrm_exec
self.protocol.cleanup_command(self.shell_id, command_id)
File “/usr/lib/python2.6/site-packages/awx/lib/site-packages/winrm/protocol.py”, line 290, in cleanup_command
rs = self.send_message(xmltodict.unparse(rq))
File “/usr/lib/python2.6/site-packages/awx/lib/site-packages/winrm/protocol.py”, line 193, in send_message
return self.transport.send_message(message)
File “/usr/lib/python2.6/site-packages/awx/lib/site-packages/winrm/transport.py”, line 136, in send_message
raise WinRMTransportError(‘http’, error_message)
WinRMTransportError: 500 WinRMTransport. Bad HTTP response returned from server. Code 500
fatal: [hostname]: FAILED! => {“failed”: true, “msg”: “failed to exec cmd PowerShell -NoProfile -NonInteractive -ExecutionPolicy Unrestricted -EncodedCommand reallylongencodedcommand==”}

If we capture the tcp traffic on the windows side we see the SYN packets arriving so we know the issue isn’t at the network level. The packets are reaching the windows box. If we run a netstat while the playbook is running we notice there are a bunch of connections then all of a sudden there are none for a bit and then we are back listening. Using the windows event log if you compare the timeline of when netstat shows no listeners and cryptographic services, dns client services, workstation service, network location service, windows remote management crash they match up perfectly. After the services crash, windows restarts them automatically and the ansible playbooks start working again. We’ve been having this issue on windows server 2012 boxes with 8gb ram and 4 cpus. We’ve been able to reproduce it with a completely vanilla server 2012 box (no antivirus or other 3rd party software installed on it). I’m at a complete loss on how to fix this.

Has anyone else seen this behavior? I haven’t found anything similar in the issue tracker or in google searches.

Forgot to mention we’ve also experimented with increasing the winrm maxconcurrentusers, maxprocessespershell, maxshellsperuser settings but haven’t seen any difference in behavior.

winrm get winrm/config/winrs

Winrs

AllowRemoteShellAccess = true

IdleTimeout = 7200000

MaxConcurrentUsers = 30

MaxShellRunTime = 2147483647

MaxProcessesPerShell = 25

MaxMemoryPerShellMB = 1024

MaxShellsPerUser = 30

Not seen this myself and having been running 2.0.0.2 against our herd of windows server 2012 boxes for months.

Did you upgrade pywinrm to 0.2.0 by any chance?

Also I spotted this bug report which sounds simliar to your case - https://github.com/ansible/ansible/issues/16873 - although the stack trace is not failing at the same point so could be something different.

Jon

No I haven’t upgrade pywinrm. Running 0.1.1.

pip show pywinrm

DEPRECATION: Python 2.6 is no longer supported by the Python core team, please upgrade your Python. A future version of pip will drop support for Python 2.6

Might be worth trying pywinrm 0.2.0 - even if its just because its much quicker than 0.1.1

However I don’t think that by itself it will fix your problem though.

Looking again the machines I’m running are actually S2012 R2 not S2012 and are mostly 2cpu 4Gb virtual machines.

If yours are S2012 not S2012R2 its worth checking the powershell and WMF version. WMF 3.0 had a bug in it that meant it would fail to run almost anything but the most trivial winrm command - if so upgrading to WMF4.0 / powershell 4.0 is thoroughly recommended.

Jon

Thanks I’ll give 0.2.0 a try.

We are running S2012R2 and powershell version 4.

PS C:\Windows\system32> $PSVersionTable.PSVersion

Major Minor Build Revision


4 0 -1 -1

We are looking at powershell 5 for other reasons. Have you tried it out with ansible?

Michael

I have run the ansible integration tests against a Server 2016 Tech Preview 5 build, which runs Powershell 5 and WMF 5.0.

The only issue I have encountered so far is with uninstalling windows features - it seems there’s a new version of the cmdlet that unininstalls features and seems to fail without an interactive user (not tested thoroughly yet so could be wrong about interactive user).

Jon

I’ve heard one other report of this happening a few weeks ago (but was via Ansible support and I didn’t know who the customer was- maybe it was also you?)

The services in question share the winrm host process, so not surprising that they’re the ones going down.

pywinrm 0.2.0 could definitely help some with this, as the HTTP(S) connections are reused for the various winrm calls within a task, where 0.1.1 and lower get a new connection for every winrm operation.

Let us know if it keeps up- it’d definitely be a Microsoft issue (winrm service shouldn’t crash. Ever.), but we might be able to short-circuit the official support loop and get you in touch with the right folks directly.

-Matt

Reading this again I realise that running multiple playbooks against my windows hosts simultaneously is something I do not do very often, so my experience may not apply.

I hope pywinrm 0.2.0 turns out to fix this for you.

Jon

Since upgrading to 0.2.0, it hasn’t occurred but this issue has been fairly hard to reproduce consistently outside of our production environment.

Michael