Windows winrm/kerberos fails intermittently in same playbook/relaunches (RESOLVED)

idlebyte · January 16, 2025, 9:46pm

Basic Ansible Automation Platform install on an Azure VM running kerberos auth’d WinRM calls using win_shell fails intermittently with ‘server not found in kerberos database’.

---
- name: Test Powershell Executions
  hosts: all
  ignore_unreachable: true
  gather_facts: false
  tasks:
  - name: WHOAMI
    ansible.windows.win_shell: "whoami"
    ignore_errors: true

dd-04 failed in the above image, a relaunch of the same template a moment later worked fine, another run a minute later will fail again. Can’t find a pattern.

Our network guy has done packet captures and isn’t seeing any errors. I can run the same test 100 times and it will be completely random if there is a failure, which host fails, and which task fails. The same host on different tasks in the same playbook will pass/fail/pass/pass/fail.

We’ve tried flushing caches, checking spn’s, and rebuilding machines. Things work great for a bit, then start intermittent failures for kerberos: authGSSClientStep() failed: ((‘Unspecified GSS failure. Minor code may provide more information’, 851968), (‘Server not found in kerberos database’, -1765328377))", “unreachable”: true. Any help in how to further troubleshoot this error would be helpful. I don’t know where to look.

99% of the time rerunning will work on the failed step, but then fail on another.

RESOLUTION
On a whim were testing krb5.conf settings and updated ticket_lifetime = 30m and renew_lifetime = 1h. Ran kdestroy and now every run works. Something was funky with the tickets where sometimes it would pass and sometimes it would fail. Won’t pretend to understand why, but leaving this here in case anyone else sees the symptoms.

jborean · January 16, 2025, 10:11pm

The error here is an error returned by a domain controller used during Kerberos authentication that the target requested does not exist. What happens in Kerberos auth is that Ansible builds the Service Principal Name (SPN) from the hostname requested with the default service name of http. In this case when it requests the service ticket for authentication it is going to request it for http/uat-coavd-dd-04.foo.com. If there is not principal in the Kerberos database that matches that principal the domain controller returns the error you see here.

Some common reasons why I’ve seen this error pop up:

You are connecting with an IP address and hosts by default don’t have an SPN registered for IP addresses
You are connecting to a newly domain joined host and the domain controller used during authentication has not had that host replicated to it

The latter is typically the cause here as the host used for the domain join may be different than the one selected by the Kerberos client. The KRB5_TRACE=/dev/stdout env var can be used to print out trace logs from the Kerberos client library and includes things like the DC it tried to contact.

idlebyte · January 16, 2025, 10:35pm

Ended up being something with the kerberos tickets. Changed ticket/renew lifetime and things started getting much better, more info in main message. If things go south again we’ll update but this is the first time it’s worked repeatedly/reliably.

Thank you for recommended steps.

Topic		Replies	Views
Ansible V2.7 - kerberos: authGSSClientStep() failed: Ansible Project windows , gcp	7	120	November 18, 2019
Weird Kerberos Issues with WinRM and a new host spun up from vmware_guest Ansible Project	4	92	March 10, 2020
Kerberos Auth - the specified credentials were rejected by the server Ansible Project windows	6	101	April 6, 2017
Server not found in Kerberos Database Ansible Project windows	8	78	December 2, 2016
Problem with WinRM Connections Ansible Project rhel , windows , aap	9	94	June 13, 2022

Windows winrm/kerberos fails intermittently in same playbook/relaunches (RESOLVED)

Related topics