WINRM issue with domain user - some works and some dont

Hi.

we have a few windows server 2008 R2 that we would like to use the winrm module.
we have similar machines that some work and some dont. i compared the build of the machine, the build of the powershell and even local security policy. the result is still the same.
we use kerberos and winbind on the controller machine and since the winrm module work for windows 2012 and some of the 2008 R2 machines with the domain username, i am guessing the issue is not on the controller.

i though it was because it uses the ticket with the ldap user i logged into the controller machine but i am a member of the administrator group on the target machine and it still doesnt work.
if i create a local username and put it in the administrator group, the winrm work.

here is a machine that works:

WINRM RESULT <Response code 0, out “C:\Users\deploy_rn\A”, err “”>
PUT /tmp/tmpe8SQvn TO C:\Users\deploy_rn\AppData\Local\Temp\ansible-tmp-1441020926.8-178247757458762\win_ping
WINRM PUT /tmp/tmpe8SQvn to C:\Users\deploy_rn\AppData\Local\Temp\ansible-tmp-1441020926.8-178247757458762\win_ping.ps1 (offset=0 size=2035)
WINRM PUT /tmp/tmpe8SQvn to C:\Users\deploy_rn\AppData\Local\Temp\ansible-tmp-1441020926.8-178247757458762\win_ping.ps1 (offset=2035 size=2035)
WINRM PUT /tmp/tmpe8SQvn to C:\Users\deploy_rn\AppData\Local\Temp\ansible-tmp-1441020926.8-178247757458762\win_ping.ps1 (offset=4070 size=2035)
WINRM PUT /tmp/tmpe8SQvn to C:\Users\deploy_rn\AppData\Local\Temp\ansible-tmp-1441020926.8-178247757458762\win_ping.ps1 (offset=6105 size=602)
PUT /tmp/tmpsiY4YG TO C:\Users\deploy_rn\AppData\Local\Temp\ansible-tmp-1441020926.8-178247757458762\arguments
WINRM PUT /tmp/tmpsiY4YG to C:\Users\deploy_rn\AppData\Local\Temp\ansible-tmp-1441020926.8-178247757458762\arguments (offset=0 size=2)
EXEC PowerShell -NoProfile -NonInteractive -ExecutionPolicy Unrestricted -File C:\Users\deploy_rn\AppData\Local\Temp\ansible-tmp-1441020926.8-178247757458762\win_ping.ps1 C:\Users\deploy_rn\AppData\Local\Temp\ansible-tmp-1441020926.8-178247757458762\arguments; Remove-Item "C:\Users\deploy_rn\AppData\Local\Temp\ansible-tmp-1441020926.8-178247757458762" -Force -Recurse;
WINRM EXEC ‘PowerShell’ [‘-NoProfile’, ‘-NonInteractive’, ‘-EncodedCommand’, ‘UABvAHcAZQByAFMAaABlAGwAbAAgAC0ATgBvAFAAcgBvAGYAaQBsAGUAIAAtAE4AbwBuAEkAbgB0AGUAcgBhAGMAdABpAHYAZQAgAC0ARQB4AGUAYwB1AHQAaQBvAG4AUABvAGwAaQBjAHkAIABVAG4AcgBlAHMAdAByAGkAYwB0AGUAZAAgAC0ARgBpAGwAZQAgAEMAOgBcAFUAcwBlAHIAcwBcAGQAZQBwAGwAbwB5AF8AcgBuAFwAQQBwAHAARABhAHQAYQBcAEwAbwBjAGEAbABcAFQAZQBtAHAAXABhAG4AcwBpAGIAbABlAC0AdABtAHAALQAxADQANAAxADAAMgAwADkAMgA2AC4AOAAtADEANwA4ADIANAA3ADcANQA3ADQANQA4ADcANgAyAFwAXAB3AGkAbgBfAHAAaQBuAGcALgBwAHMAMQAgAEMAOgBcAFUAcwBlAHIAcwBcAGQAZQBwAGwAbwB5AF8AcgBuAFwAQQBwAHAARABhAHQAYQBcAEwAbwBjAGEAbABcAFQAZQBtAHAAXABhAG4AcwBpAGIAbABlAC0AdABtAHAALQAxADQANAAxADAAMgAwADkAMgA2AC4AOAAtADEANwA4ADIANAA3ADcANQA3ADQANQA4ADcANgAyAFwAXABhAHIAZwB1AG0AZQBuAHQAcwA7ACAAUgBlAG0AbwB2AGUALQBJAHQAZQBtACAAIgBDADoAXABVAHMAZQByAHMAXABkAGUAcABsAG8AeQBfAHIAbgBcAEEAcABwAEQAYQB0AGEAXABMAG8AYwBhAGwAXABUAGUAbQBwAFwAYQBuAHMAaQBiAGwAZQAtAHQAbQBwAC0AMQA0ADQAMQAwADIAMAA5ADIANgAuADgALQAxADcAOAAyADQANwA3ADUANwA0ADUAOAA3ADYAMgBcACIAIAAtAEYAbwByAGMAZQAgAC0AUgBlAGMAdQByAHMAZQA7AA==’]
WINRM RESULT <Response code 0, out “{ “changed”: f”, err “”>
rnpl-qa1-bes01 | success >> {
“changed”: false,
“ping”: “pong”
}

here is one that doesnt work:

ESTABLISH WINRM CONNECTION FOR USER: on PORT 5986 TO rnpl-qa1-sts01
ESTABLISH WINRM CONNECTION FOR USER: on PORT 5986 TO rnpl-qa1-sts02
WINRM CONNECT: transport=kerberos endpoint=https://rnpl-qa1-sts01:5986/wsman
WINRM CONNECT: transport=kerberos endpoint=https://rnpl-qa1-sts02:5986/wsman
rnpl-qa1-sts01 | FAILED => the username/password specified for this server was incorrect
rnpl-qa1-sts02 | FAILED => the username/password specified for this server was incorrect

as soon as i remove the @DOMAIN from the host file, and use a local username, the winrm works.
i am probably missing a silly thing but i cant find it.
thanks

Another info:
this i get on the server that doesnt work and the one that does.

winrm get winrm/config/client/auth
Auth
Basic = true
Digest = true
Kerberos = true
Negotiate = true
Certificate = true
CredSSP = false

this is in the event viewer:

User authentication using Basic authentication scheme failed.

Additional Data
Unexpected error received from LogonUser 1326: %%1326.

event ID 10111.

And you’re sure the hosts are included in the Ansible hosts file?

Hi
the servers are of course in the host file.

Ok some updates on this but first information:

Domain controller : 172.16.10.6
Ansible controller - 172.16.19.1
server that works (STS03) - 172.16.19.41

servers that DOESNT work (STS01) - 172.16.1.114

now if i try with a domain username to access from ansible to STS03 (that works), it is all good.
if i try with a domain username to access from ansible to STS01 (doesnt work) - i get the “server not found in kerberos database” and “username is incorrect”

now if i take the server that doesnt work and move it to the same network (172.16.19.42) near the server that works - everything is working on both servers.

as soon as it is in another vlan, the domain username doesnt work anylonger (a local username on the machine works anywhere).

so i suspected it is maybe something on the dc (in the firewall i have ANY to ANY on all 4 servers: DC, ansible , STS01 & STS 03).

i ran wireshark on the DC and ran against both servers:

when the ansible runs again the server INSIDE the network (STS03) i see this:
172.16.10.6 172.16.19.41 TCP 66 kerberos > 55200 [SYN, ACK] Seq=0 Ack=1 Win=8192 Len=0 MSS=1460 WS=256 SACK_PERM=1

172.16.10.6 172.16.19.41 TCP 54 kerberos > 55200 [RST, ACK] Seq=1441 Ack=1419 Win=0 Len=0

so it seems that the DC is working directly against the destination server.

BUT if i run the same winrm against the server in another VLAN i see this:
172.16.10.6 172.16.12.71 KRB5 176 KRB Error: KRB5KDC_ERR_S_PRINCIPAL_UNKNOWN

172.16.10.6 172.16.12.71 TCP 54 kerberos > 60772 [RST, ACK] Seq=111 Ack=1441 Win=0 Len=0

it seems that when the destination server is in another VLAN, the kerberos is checked against the controller machine and not the destination server.

could i be on to something?

EYal, just a thought: Could you try replacing ip addresses in your hosts file with actual servername fqdns (sts03.domain.com) and see if that helps?

Trond - thanks for the tip.
it actually helped because using tcpdump we saw that we had more than 1 PTR records for the server.
as soon as we fixed that, the winrm worked.
again it was weird since it did work in the same network but not over vlan.

Good. Kerberos relies on service principal names (which again relies on name resolution), so you need a working DNS infrastructure for Kerberos to work correctly.

Hi again.
well after removing all extra PTR the servers where good to do.
i started deploying the ansible on production servers and here i have the same issue exactly but this time the dns and resolve are correct.
local user on the machine is working perfectly
domain user will produce the "

FAILED => the username/password specified for this server was incorrect" error message.

is there any logs i can check or extra errors i can check?
thanks

Could you test regular ps remoting from another domain-joined windows node against the problematic servers to see if that works?

yes that works.
from one machine to another with ps-remotesession i had no problem.
even with the domain username and password i was able to connect.
this happens to all the windows machine in the domain.
beside the powershell script to prepare for ansible i tried to add the security permission for the user but it still doesnt work.

the winrm is ready since i am able to connect with a local username that is in the administrators group.

i already got a few windows machine to work with the domain username so i am probably just missing something
this is from a machine that works:

PS C:\Users\TEMP.JAJAH> winrm get winrm/config/service
Service
RootSDDL = O:NSG:BAD:P(A;;GA;;;BA)(A;;GWGR;;;S-1-5-21-1738876665-1027346198-3318579073-26131)(A;;GR;;;IU)S:P(AU;FA;G
A;;;WD)(AU;SA;GXGW;;;WD)
MaxConcurrentOperations = 4294967295
MaxConcurrentOperationsPerUser = 1500
EnumerationTimeoutms = 240000
MaxConnections = 300
MaxPacketRetrievalTimeSeconds = 120
AllowUnencrypted = true
Auth
Basic = true
Kerberos = true
Negotiate = true
Certificate = false
CredSSP = false
CbtHardeningLevel = Relaxed
DefaultPorts
HTTP = 5985
HTTPS = 5986
IPv4Filter = *
IPv6Filter = *
EnableCompatibilityHttpListener = false
EnableCompatibilityHttpsListener = false
CertificateThumbprint
AllowRemoteAccess = true

this is from a machine that doesnt work:

Service
RootSDDL = O:NSG:BAD:P(A;;GA;;;BA)(A;;GR;;;IU)S:P(AU;FA;GA;;;WD)(AU;SA;GXGW;;;WD)
MaxConcurrentOperations = 4294967295
MaxConcurrentOperationsPerUser = 1500
EnumerationTimeoutms = 240000
MaxConnections = 300
MaxPacketRetrievalTimeSeconds = 120
AllowUnencrypted = false
Auth
Basic = true
Kerberos = true
Negotiate = true
Certificate = false
CredSSP = false
CbtHardeningLevel = Relaxed
DefaultPorts
HTTP = 5985
HTTPS = 5986
IPv4Filter = *
IPv6Filter = *
EnableCompatibilityHttpListener = false
EnableCompatibilityHttpsListener = false
CertificateThumbprint = eb 9b 2d f2 a5 89 03 f2 e2 ca 0e 8a 35 32 39 08c5 a8 42 d7
AllowRemoteAccess = true

This line diffs:
AllowUnencrypted = false

That setting basically dictates wether you’re allowed to use basic auth using non-encrypted comms.
Again, from another windows node could you ensure that you’re able to connect to the problematic server using basic auth? Try both with and without the -usessl parameter and compare to a working node. I suspect you will find some diffs.

In general we advise using 5986 (SSL) with Ansible.