Weird Kerberos Issues with WinRM and a new host spun up from vmware_guest

Hi Ansible Community. I’ve been struggling with an issue I’ve actually posted about here before. It’s more of an annoyance than anything but I’d really love to get past it, as I’m trying to demo Infrastructure-as-code to my org.

I have a playbook that spins up a new vm using vmware_guest, and afterwards it adds the new host to a staging group. The playbook machine credentials are using a username that is part of the domain admin group.

The weird part: The first time the playbook runs, the first step after the vmware_guest module that actually connects to the new host fails with a kerberos error:

fatal: [webserver.internal.domain.com]: UNREACHABLE! => {“changed”: false, “msg”: “kerberos: the specified credentials were rejected by the server”, “unreachable”: true}

The even weirder part: If I go to run the playbook again, it will actually perform a few of the plays it got the above error on, but then it will again fail a few plays in. By the 3rd or 4th run, eventually I can run the playbook in its entirety without error.

I’ve done a LOT of troubleshooting on this and I can’t seem to figure out why it’s not working on the first play.

Here are some things I’ve checked:

  • I am able to RDP into the server with the same domain admin credentials the playbook is using right around the time we get a kerberos failure in ansible
  • I am able to Enter-PSSession at the same time the playbook gets the kerberos failure
  • I’ve confirmed that the SPN for WSMan is on the computer object in AD and replicated at the time the Kerberos issue happens
  • I’ve checked the RootSDDL and plugin SDDL’s on the win2016 template im using using winrm e winrm/config/plugin -format:pretty and winrm get winrm/config - builtin/administrators group seems to have full access to rootSDDL and the powershell plugins
  • I’ve confirmed that a reverse and forward DNS entry exists in <internal.domain.com> DNS for kerberos
  • I’ve checked that krb5.conf on the tower machine has rdns set to false
  • I’ve confirmed that time is in sync between the new host, and the tower host, and the domain controllers
  • I’ve reviewed any GPO’s affecting the new host and ruled out any settings that may interfere with Kerberos/WinRM

I did notice my krb5.conf in my tower box is configured to be part of IDM.internal.domain.com whereas my domain is actually just internal.domain.com. As my linux team is in the process of getting centralized auth going with IdM. I’m not sure if that has something to do with it, but auth does seem to be working once the ‘weird’ issues above go away.

If anyone has any other ideas, they would be greatly appreciated.

Did you Configure the WinRM ? for CredSSP ?

runonce:

[win]
SERVER_IP

[win:vars]
ansible_user=“.\Administrator”
ansible_password=
ansible_connection=winrm
ansible_winrm_transport=credssp
ansible_winrm_server_cert_validation=ignore

Just to be clear, are you joining the host to the domain as part of the vmware_guest call?

I have playbooks that do something similar to what you describe but with some differences.
I like to drive everything from inventory so I add the host details to (static) inventory and then run the playbook with the vmware_guest task delegated to localhost. This means I don’t have to add_host and can clone multiple vms in parallel (if I am feeling lucky/patient).

After vmware_guest has completed I put in a fairly huge wait iirc 600 seconds, then I do a wait_for_connection again with a long timeout - i think around 600 seconds again.

The domain join seems to take a long time and there is a reboot of the target involved I think. I have definitely seen the winrm service ‘jitter’ i.e start and then become unavailable for a while before becoming available again as other services come up on startup, hence the big long wait and then polling with wait_for_connection before attempting to run main playbook content.

Its not infallible - sometimes it fails to respond before the wait_for_connection has timed out but it depends on what else is going on in vpshere, but I think you might be experiencing the winrm ‘jitter’ so adding a wait and then polling till winrm becomes available might get you to the point where you can at least set it running and let the playbook run through.

Hope this helps,

Jon

Thanks David - I’ve been trying to use Kerberos, and it should be enabled. I’m only connecting as a domain admin so Kerberos should work (or so I gather):

Auth

Basic = false

Kerberos = true

Negotiate = true

Certificate = false

CredSSP = false

CbtHardeningLevel = Relaxed

Yep - I am joining the domain as part of the customization in vmware_guest. I do that locally from the tower box. After vmware_guest I have a wait_for port 5985 with a 360 timeout. I guess I was trying to avoid the 600 second sleep but I guess if it works, it works.

Thanks for the insight, glad to know someone else is seeing something similar. :slight_smile: