Hi everyone!
I am creating rolesfor automatic patching of machines. As part of the patching, we also clean up old kernels after a reboot. However, the command that I run to do that doesn’t work, and consistently results in a “unreachable” error.
Here is the role playbook below:
---
- name: Reboot machine after updates have taken place
ansible.builtin.reboot:
- name: Determine active kernel
ansible.builtin.command:
cmd: "uname -r"
changed_when: false
register: cleanup_kernels_active_kernel
- name: Determine newest available kernel
ansible.builtin.shell:
cmd: "set -o pipefail && dnf search --showduplicates kernel-core | tail -n 1"
changed_when: false
register: cleanup_kernels_newest_kernel
- name: Check if the current kernel is the newest kernel
ansible.builtin.fail:
when: cleanup_kernels_active_kernel.stdout not in cleanup_kernels_newest_kernel.stdout
- name: Remove old kernel and module files
ansible.builtin.command:
cmd: dnf remove --oldinstallonly --setopt installonly_limit=2 kernel
changed_when: false
All jobs run normally, except the final one:
TASK [company.update_roles.cleanup_kernels : Reboot machine after updates have taken place] ***
changed: [devweb1001]
changed: [devapp1001]
TASK [company.update_roles.cleanup_kernels : Determine active kernel] ***********
ok: [devweb1001]
ok: [devapp1001]
TASK [company.update_roles.cleanup_kernels : Determine newest available kernel] ***
ok: [devapp1001]
ok: [devweb1001]
TASK [company.update_roles.cleanup_kernels : Check if the current kernel is the newest kernel] ***
skipping: [devweb1001]
skipping: [devapp1001]
TASK [company.update_roles.cleanup_kernels : Remove old kernel and module files] ***
fatal: [devapp1001]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Shared connection to 10.0.0.19 closed.", "unreachable": true}
fatal: [devweb1001]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Shared connection to 10.0.0.18 closed.", "unreachable": true}
I have investigated the issue and have some observations:
- The other jobs succeed and the firewall shows no blocked traffic between the controller and the host, ruling out connection issues.
- Manually running the command works fine and rerunning the playbook works up until the final step, ruling out the possibility of a lockout.
- Replacing the command with a simple echo command works, so the command itself is tied to the issue.
- Running the command using the shell module results in the same behaviour, so the command module specifically is not at fault either.
- Running this against other machines also results in the same behaviour. All machines are RHEL9 with most of the CIS-recommended hardening implemented.
- Running with the shell module results in a python script generated by Ansible being left behind on the host in
/home/user/.ansible/tmp/ansible-tmp-<long-number>/AnsiballZ_command.py. Running this script manually hangs the shell. I suspect that that may be (part of) the cause, but I might also be executing it incorrectly.
As for our environment; we run ansible in a container based on alpine/ansible:2.20.0
Does anyone have an inkling as to what may be causing these issues? I am happy to provide more information, but at this point I just don’t know what else to look for.