AWX Job error after executing reboot module

I’m installing a new kernel using dnf commands as an ansible role and running a reboot task as a handler
the job finished with an error state

Here are the steps

  1. Install new kernel on RHEL 8.6 Baremetal server using dnf module
  2. reboot the server (this takes more than 10 min )

FYI, AWX Job error after executing reboot module · Issue #14725 · ansible/awx · GitHub

Any help here
I have seen other related #12297, have now set RECEPTOR_KUBE_SUPPORT_RECONNECT to off according to #13380 (comment) but seems not helping.

Hello @mogamal1 welcome to the Ansible Community Forum!

Could you kindly provide a bit more information about your issue? This would greatly assist us in pinpointing the root cause of the problem.

  • What error do you get on AWX?
  • Are you using the default control plane EE on AWX, or did you just build your own with ansible-builder? What ansible-core version does it run?
  • Knowing the AWX version may also help
  • Also, If you could share your role’s implementation, or at least part of it, that would be helpful

I’m guessing you did set-up the credentials and user elevation properly on the AWX Template Job, since I understand from your post that the kernel update task did finish successfully, is that right?

For debugging purposes, I recommend you running the role or playbook from the command line interface (CLI). This way, you can check if the issue is connected to your AWX setup. If not, it could be something else affecting the role’s execution… I’m sharing with you a little test I performed using ansible-navigator, see if that helps you on your debugging quest (feel free to use my EE if you need to, it’s publicly available on quay.io):

Playbook:

---
- name: Update Kernel and reboot RHEL 8.6
  hosts: all
  gather_facts: false
  remote_user: root
  tasks:

    - name: Install new RHEL Kernel
      ansible.builtin.dnf:
        enablerepo:
          - rhel-8-for-x86_64-baseos-rpms
        name:
          - kernel.x86_64
        update_only: true
      notify: Reboot RHEL Host

  handlers:

    - name: Reboot RHEL Host
      ansible.builtin.reboot:
        pre_reboot_delay: 10
        msg: "System will be rebooting in 10 seconds..."
        reboot_command: reboot
...

CLI commands:

$ ssh-add ~/.ssh/id_ed25519_kvm.beri.cat
$ ssh-copy-id -i id_ed25519_kvm.beri.cat.pub root@192.168.30.29
$ ansible-navigator run mogamal_3062.yml \
  --eei quay.io/jordi_bericat/ansible-ee:2.15-latest \
  --inventory 192.168.30.29, \
  --private-key id_ed25519_kvm.beri.cat \

Reboot handler results:

Play name: Update Kernel and reboot RHEL 8.6:1
Task name: Reboot RHEL Host
CHANGED: 192.168.30.29                                                                                                                                                                  
 0│---
 1│duration: 19.349736
 2│end: '2023-12-27T23:44:56.981841'
 3│event_loop: null
 4│host: 192.168.30.29
 5│play: Update Kernel and reboot RHEL 8.6
 6│play_pattern: all
 7│playbook: /home/beri/MyStuff/dev/repos/ansible/ansible-forum/mogamal_3062.yml
 8│remote_addr: 192.168.30.29
 9│res:
10│  _ansible_no_log: null
11│  changed: true
12│  elapsed: 17
13│  rebooted: true
14│resolved_action: ansible.builtin.reboot
15│start: '2023-12-27T23:44:37.632105'
16│task: Reboot RHEL Host
17│task_action: ansible.builtin.reboot
18│task_args: ''
19│task_path: /home/beri/MyStuff/dev/repos/ansible/ansible-forum/mogamal_3062.yml:20

PS: Just a heads up—I’m planning to run the same playbook from my AWX host in the meantime. However, it might take a bit longer because my lab server is currently undergoing maintenance. So, I’ll get to it as soon as I can.

Cheers

1 Like

This issue seems happening on any task that takes more than 10 minutes to execute, even a long running dnf install ... (intentionally takes more than 10 minutes to install).
AWX version: 23.0.0
Ansible-core version: 2.14.9

AWX task POD output:

2023-12-13 16:48:30,374 INFO     [08f4137fcb424101ae2f2037d88b5c57] awx.analytics.job_lifecycle job-7790 waiting
2023-12-13 16:48:31,288 INFO     [08f4137fcb424101ae2f2037d88b5c57] awx.analytics.job_lifecycle job-7790 pre run
2023-12-13 16:48:31,452 INFO     [08f4137fcb424101ae2f2037d88b5c57] awx.analytics.job_lifecycle job-7790 notifications sent
2023-12-13 16:54:59,484 INFO     [08f4137fcb424101ae2f2037d88b5c57] awx.analytics.job_lifecycle job-7790 preparing playbook
2023-12-13 16:55:02,456 INFO     [08f4137fcb424101ae2f2037d88b5c57] awx.analytics.job_lifecycle job-7790 running playbook
2023-12-13 16:55:11,892 INFO     [08f4137fcb424101ae2f2037d88b5c57] awx.analytics.job_lifecycle job-7790 work unit id received
2023-12-13 16:55:11,993 INFO     [08f4137fcb424101ae2f2037d88b5c57] awx.analytics.job_lifecycle job-7790 work unit id assigned
2023-12-13 17:04:07,233 INFO     [08f4137fcb424101ae2f2037d88b5c57] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 7790
2023-12-13 17:04:07,276 INFO     [08f4137fcb424101ae2f2037d88b5c57] awx.analytics.job_lifecycle job-7790 post run
2023-12-13 17:04:12,877 INFO     [08f4137fcb424101ae2f2037d88b5c57] awx.analytics.job_lifecycle job-7790 finalize run
2023-12-13 17:04:13,173 INFO     [08f4137fcb424101ae2f2037d88b5c57] awx.analytics.job_lifecycle job-7790 notifications sent
2023-12-13 17:04:13,849 WARNING  [08f4137fcb424101ae2f2037d88b5c57] awx.main.dispatch job 7790 (error) encountered an error (rc=None), please see task stdout for details.

2 Likes

@mogamal1 @yhzs8

what k8s environment are you using (aks, k3s, etc)?

some things to try:

  1. enable RECEPTOR_KUBE_SUPPORT_RECONNECT on the AWX spec file

  2. make sure your container log size is increased

  3. enable keep alive setting in AWX settings

see Job terminated in error after 4 hours · Issue #14457 · ansible/awx · GitHub

4 Likes

@fosterseth we are using AKS.

Thank for the proposal, setting K8S Ansible Runner Keep-Alive Message Interval to 180 seems solved this issue for any task that takes 10 minute to execute.

We do have another error sample where the exact same rows printed out on the task POD log not after nearly 9 minutes, rather after 20 sec:

2024-01-05 20:10:45,652 INFO     [a5d9366112384c909ab58a74d02f6ac6] awx.analytics.job_lifecycle job-7972 waiting
2024-01-05 20:10:46,063 INFO     [a5d9366112384c909ab58a74d02f6ac6] awx.analytics.job_lifecycle job-7972 pre run
2024-01-05 20:10:46,213 INFO     [a5d9366112384c909ab58a74d02f6ac6] awx.analytics.job_lifecycle job-7972 notifications sent
2024-01-05 20:16:56,678 INFO     [a5d9366112384c909ab58a74d02f6ac6] awx.analytics.job_lifecycle job-7972 preparing playbook
2024-01-05 20:16:58,622 INFO     [a5d9366112384c909ab58a74d02f6ac6] awx.analytics.job_lifecycle job-7972 running playbook
2024-01-05 20:17:25,924 INFO     [a5d9366112384c909ab58a74d02f6ac6] awx.analytics.job_lifecycle job-7972 work unit id received
2024-01-05 20:17:26,009 INFO     [a5d9366112384c909ab58a74d02f6ac6] awx.analytics.job_lifecycle job-7972 work unit id assigned
2024-01-05 20:17:45,313 INFO     [a5d9366112384c909ab58a74d02f6ac6] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 7972
2024-01-05 20:17:45,424 INFO     [a5d9366112384c909ab58a74d02f6ac6] awx.analytics.job_lifecycle job-7972 post run
2024-01-05 20:17:50,787 INFO     [a5d9366112384c909ab58a74d02f6ac6] awx.analytics.job_lifecycle job-7972 finalize run
2024-01-05 20:17:50,945 INFO     [a5d9366112384c909ab58a74d02f6ac6] awx.analytics.job_lifecycle job-7972 notifications sent
2024-01-05 20:17:52,009 WARNING  [a5d9366112384c909ab58a74d02f6ac6] awx.main.dispatch job 7972 (error) encountered an error (rc=None), please see task stdout for details.
2024-01-05 20:17:26,009 INFO     [a5d9366112384c909ab58a74d02f6ac6] awx.analytics.job_lifecycle job-7972 work unit id assigned
2024-01-05 20:17:45,313 INFO     [a5d9366112384c909ab58a74d02f6ac6] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 7972

Do you know any other case

awx.main.commands.run_callback_receiver Starting EOF event processing for Job 7972

is printed out not due to timeout.

hey @yhzs8 were you able to resolve the initial issue that you ran into? If this new error is separate from the original one, can you create a new thread so that we don’t get wires crossed etc?