Windows winrm/kerberos fails intermittently in same playbook/relaunches (RESOLVED)

Basic Ansible Automation Platform install on an Azure VM running kerberos auth’d WinRM calls using win_shell fails intermittently with ‘server not found in kerberos database’.

---
- name: Test Powershell Executions
  hosts: all
  ignore_unreachable: true
  gather_facts: false
  tasks:
  - name: WHOAMI
    ansible.windows.win_shell: "whoami"
    ignore_errors: true

dd-04 failed in the above image, a relaunch of the same template a moment later worked fine, another run a minute later will fail again. Can’t find a pattern.

Our network guy has done packet captures and isn’t seeing any errors. I can run the same test 100 times and it will be completely random if there is a failure, which host fails, and which task fails. The same host on different tasks in the same playbook will pass/fail/pass/pass/fail.

We’ve tried flushing caches, checking spn’s, and rebuilding machines. Things work great for a bit, then start intermittent failures for kerberos: authGSSClientStep() failed: ((‘Unspecified GSS failure. Minor code may provide more information’, 851968), (‘Server not found in kerberos database’, -1765328377))", “unreachable”: true. Any help in how to further troubleshoot this error would be helpful. I don’t know where to look.

99% of the time rerunning will work on the failed step, but then fail on another. :sob:

RESOLUTION
On a whim were testing krb5.conf settings and updated ticket_lifetime = 30m and renew_lifetime = 1h. Ran kdestroy and now every run works. Something was funky with the tickets where sometimes it would pass and sometimes it would fail. Won’t pretend to understand why, but leaving this here in case anyone else sees the symptoms.

The error here is an error returned by a domain controller used during Kerberos authentication that the target requested does not exist. What happens in Kerberos auth is that Ansible builds the Service Principal Name (SPN) from the hostname requested with the default service name of http. In this case when it requests the service ticket for authentication it is going to request it for http/uat-coavd-dd-04.foo.com. If there is not principal in the Kerberos database that matches that principal the domain controller returns the error you see here.

Some common reasons why I’ve seen this error pop up:

  • You are connecting with an IP address and hosts by default don’t have an SPN registered for IP addresses
  • You are connecting to a newly domain joined host and the domain controller used during authentication has not had that host replicated to it

The latter is typically the cause here as the host used for the domain join may be different than the one selected by the Kerberos client. The KRB5_TRACE=/dev/stdout env var can be used to print out trace logs from the Kerberos client library and includes things like the DC it tried to contact.

Ended up being something with the kerberos tickets. Changed ticket/renew lifetime and things started getting much better, more info in main message. If things go south again we’ll update but this is the first time it’s worked repeatedly/reliably.

Thank you for recommended steps.