Basic Ansible Automation Platform install on an Azure VM running kerberos auth’d WinRM calls using win_shell fails intermittently with ‘server not found in kerberos database’.
---
- name: Test Powershell Executions
hosts: all
ignore_unreachable: true
gather_facts: false
tasks:
- name: WHOAMI
ansible.windows.win_shell: "whoami"
ignore_errors: true
dd-04 failed in the above image, a relaunch of the same template a moment later worked fine, another run a minute later will fail again. Can’t find a pattern.
Our network guy has done packet captures and isn’t seeing any errors. I can run the same test 100 times and it will be completely random if there is a failure, which host fails, and which task fails. The same host on different tasks in the same playbook will pass/fail/pass/pass/fail.
We’ve tried flushing caches, checking spn’s, and rebuilding machines. Things work great for a bit, then start intermittent failures for kerberos: authGSSClientStep() failed: ((‘Unspecified GSS failure. Minor code may provide more information’, 851968), (‘Server not found in kerberos database’, -1765328377))", “unreachable”: true. Any help in how to further troubleshoot this error would be helpful. I don’t know where to look.
99% of the time rerunning will work on the failed step, but then fail on another.
RESOLUTION
On a whim were testing krb5.conf settings and updated ticket_lifetime = 30m and renew_lifetime = 1h. Ran kdestroy and now every run works. Something was funky with the tickets where sometimes it would pass and sometimes it would fail. Won’t pretend to understand why, but leaving this here in case anyone else sees the symptoms.