Windows winrm/kerberos fails intermittently in same playbook/relaunches (RESOLVED)

Basic Ansible Automation Platform install on an Azure VM running kerberos auth’d WinRM calls using win_shell fails intermittently with ‘server not found in kerberos database’.

---
- name: Test Powershell Executions
  hosts: all
  ignore_unreachable: true
  gather_facts: false
  tasks:
  - name: WHOAMI
    ansible.windows.win_shell: "whoami"
    ignore_errors: true

dd-04 failed in the above image, a relaunch of the same template a moment later worked fine, another run a minute later will fail again. Can’t find a pattern.

Our network guy has done packet captures and isn’t seeing any errors. I can run the same test 100 times and it will be completely random if there is a failure, which host fails, and which task fails. The same host on different tasks in the same playbook will pass/fail/pass/pass/fail.

We’ve tried flushing caches, checking spn’s, and rebuilding machines. Things work great for a bit, then start intermittent failures for kerberos: authGSSClientStep() failed: ((‘Unspecified GSS failure. Minor code may provide more information’, 851968), (‘Server not found in kerberos database’, -1765328377))", “unreachable”: true. Any help in how to further troubleshoot this error would be helpful. I don’t know where to look.

99% of the time rerunning will work on the failed step, but then fail on another. :sob:

RESOLUTION
On a whim were testing krb5.conf settings and updated ticket_lifetime = 30m and renew_lifetime = 1h. Ran kdestroy and now every run works. Something was funky with the tickets where sometimes it would pass and sometimes it would fail. Won’t pretend to understand why, but leaving this here in case anyone else sees the symptoms.

The error here is an error returned by a domain controller used during Kerberos authentication that the target requested does not exist. What happens in Kerberos auth is that Ansible builds the Service Principal Name (SPN) from the hostname requested with the default service name of http. In this case when it requests the service ticket for authentication it is going to request it for http/uat-coavd-dd-04.foo.com. If there is not principal in the Kerberos database that matches that principal the domain controller returns the error you see here.

Some common reasons why I’ve seen this error pop up:

  • You are connecting with an IP address and hosts by default don’t have an SPN registered for IP addresses
  • You are connecting to a newly domain joined host and the domain controller used during authentication has not had that host replicated to it

The latter is typically the cause here as the host used for the domain join may be different than the one selected by the Kerberos client. The KRB5_TRACE=/dev/stdout env var can be used to print out trace logs from the Kerberos client library and includes things like the DC it tried to contact.

1 Like

Ended up being something with the kerberos tickets. Changed ticket/renew lifetime and things started getting much better, more info in main message. If things go south again we’ll update but this is the first time it’s worked repeatedly/reliably.

Thank you for recommended steps.

This put me on the right track. I have two hosts in my lab, a domain controller and a domain member. Connecting the DC gave the ‘Server not found in Kerberos database’ error, the member server worked fine.
When comparing SPNs for both machines I saw that the DC didn’t have a WSMAN/… SPN, where the member server did. After adding a WSMAN spn for the DC everything works fine.

BTW: both machines have no HTTP/… SPN

1 Like

Just as an FYI to try and align the behaviour with the native client the default of pypsrp will be changing from using WSMAN to HTTP with Change default Negotiate service to HTTP by jborean93 · Pull Request #213 · jborean93/pypsrp · GitHub. This should improve compatibility with the defaults on Windows and avoid having to manually set SPNs or changing the service used in the client.

Does this implicate that machines having a WSMAN spn and not having a HTTP spn will fail in the nearby future? Or will WSMAN still be tried when the HTTP spn isn’t there?

Does this implicate that machines having a WSMAN spn and not having a HTTP spn will fail in the nearby future? Or will WSMAN still be tried when the HTTP SPN isn’t there?

Yes it will technically fail if there is no HTTP SPN registered and it won’t fallback to WSMAN. But keep in mind that while WSMAN/* is an SPN that is explicitly registered to a principal, HTTP is one that is implicitly there for “HOST” registrations. We can see that the servicePrincipalName for a computer principle contains an explicitly entry for WSMAN but nothing for HTTP.

(Get-ADComputer TESTHOST -Property servicePrincipalName).servicePrincipalName

WSMAN/testhost
WSMAN/testhost.domain.test
TERMSRV/TESTHOST
TERMSRV/testhost.domain.test
RestrictedKrbHost/TESTHOST
HOST/TESTHOST
RestrictedKrbHost/testhost.domain.test
HOST/testhost.domain.test

But for any HOST/* entries it will also automatically apply the following mappings to the principal

$dse = Get-ADRootDSE
$mappings = Get-ADObject -Identity "CN=Directory Service,CN=Windows NT,CN=Services,$($dse.ConfigurationNamingContext)" -Properties sPNMappings
$mappings.sPNMappings

# host=alerter,appmgmt,cisvc,clipsrv,browser,dhcp,dnscache,replicator,eventlog,eventsystem,policyagent,oakley,dmserver,dns,mcsvc,fax,msiserver,ias,messenger,netlogon,netman,netdde,netddedsm,nmagent,plugplay,protectedstorage,rasman,rpclocator,rpc,rpcss,remoteaccess,rsvp,samss,scardsvr,scesrv,seclogon,scm,dcom,cifs,spooler,snmp,schedule,tapisrv,trksvr,trkwks,ups,time,wins,www,http,w3svc,iisadmin,msdtc

We can see that http is part of this list so will automatically be part of a host principal vs WSMAN needing to be registered specifically by the service. This is the cause behind your problem where some hosts had WSMAN but other hosts like a domain controller did not causing problems.

The pywinrm library used by the winrm connection plugin uses HTTP and is why I changed the default to that in pypsrp.

I might revisit this before the next release because after tracing the normal WinRM client in PowerShell I’ve found that Kerberos actually uses the HOST SPN and not HTTP

# .\Trace-Process.ps1 -ProcessId $pwshClientProcessId -Metadata @{Secur32='Initialize*'} -OutputFormat Yaml

- Function: Secur32.dll!InitializeSecurityContextW
  Time: 2026-01-13T04:58:12.0801232+10:00
  ThreadId: 2584
  Arguments:
    Credential: 0x262E51DC088 - PCredHandle
    Context: 0x00000000 - PCtxtHandle
    TargetName:
      Raw: 0x262E51DC0AC - Pointer
      Value: HOST/test.domain.test  # <<<<----- Shows PowerShell uses HOST
    ContextReq: 0x00000017 - ISC_REQ_DELEGATE, ISC_REQ_MUTUAL_AUTH, ISC_REQ_REPLAY_DETECT, ISC_REQ_CONFIDENTIALITY
    Reserved1: 0
    TargetDataRep: 0x00000010 - SECURITY_NATIVE_DREP
    Input:
      Raw: 0x00000000 - PSecBufferDesc
    Reserved2: 0
    NewContext: 0x262E51DC098 - PCtxtHandle
    Output:
      Raw: 0x347BE0E910 - PSecBufferDesc
      Version: 0
      Count: 1
      BufferPtr: 0x347BE0E8E8 - PSecBuffer
      Buffers:
      - Type: 0x00000002 - SECBUFFER_TOKEN
        Flags: 0x00000000 - SECBUFFER_NONE
        Size: 48256
        Raw: 0x262E5B32070 - Pointer
        Data: ""
    ContextAttr: 0x347BE0E8C0 - Pointer
    Expiry: 0x347BE0E900 - PTimeStamp

From a consistency with pywinrm using HTTP makes sense but using HOST seems to be more correct and is what the native client does. It also ensures that hosts where the HTTP SPN was registered to another principal wouldn’t have WinRM fail with Kerberos auth as the HOST principal should always be the host.

1 Like

:light_bulb: That explains why Enter-PSSession <ComputerName> -Authentication Kerberos just works without the WSMAN and/or HTTP SPNs.

The HTTP one is auto mapped to the HOST/* entry so that’ll always be there but WSMAN is the one that needs the explicit entry and the reason why I’m trying to move away from it. It should technically be there but I’ve seen principals without it (domain controllers).