AWX job terminated unexpectedly

Hello,
I’m having an issue with a job template on AWX that terminated unexpectedly.
AKS version: 1.23.12
AWX version: 21.6.0

It’s a long automation but after 31m and 7000 tasks executed fails each time as you can see from pic.

No errors displayed in Stdout in AWX ui.

Running:
kubectl logs awx-c7647dbf5-2mt88 -n awx -c awx-task | grep error

I see this error:
2022-10-20 11:24:53,613 WARNING [a5fc6bd17e2146a883b741b0bd8a3bbd] awx.main.dispatch job 142359 (error) encountered an
error (rc=None), please see task stdout for details.

I’ve tried to get logs directly from automation pod:
kubectl logs automation-job-142359-nxxpb -n awx -f

The automation stops without errors displayed and automation pod destroyed from aks.
This is the last line before the crash:
{“uuid”: “5fce5d17-9f63-4705-bf05-adf9b4b2a73b”, “counter”: 26487, “stdout”: “”, “start_line”: 24454, “end_line”: 24454, “runner_ident”: “142359”, “event”: “runner_on_start”, “job_id”: 142359, “pid”: 21, “created”: “2022-10-20T11:19:39.658349”, “parent_uuid”: “02901256-661c-2a55-e472-000000001bc8”, “event_data”: {“playbook”: “log_query_for_all_cids_from_appservers.yml”, “playbook_uuid”: “ef3e2675-daf8-4ac4-84db-9b797d3ffc6e”, “play”: “log query on tse db for all customers”, “play_uuid”: “02901256-661c-2a55-e472-000000000066”, “play_pattern”: “res_alyante_srv_grp_prod”, “task”: “set facts”, “task_uuid”: “02901256-661c-2a55-e472-000000001bc8”, “task_action”: “set_fact”, “task_args”: “”, “task_path”: “/runner/project/roles/common/exec_query_with_logs_on_win_vm/tasks/log_query.yml:2”, “role”: “exec_query_with_logs_on_win_vm”, “host”: “aly110-srv”, “uuid”: “5fce5d17-9f63-4705-bf05-adf9b4b2a73b”}}

What could be the problem? Why the automation-job pod is destroyed unexpectedly? Is possible something like a timeout after about 31min?
Who runs the command for delete automation-job pod?

Thank you for your help.

I’ve done other test we a simple role like this:

- name: loop date for 60 min
shell: |
date
register: result
until: result.stdout == “LOOP”
retries: 60
delay: 60

It’s not an issue due to a timeout… it worked fine for 60 minutes.
So what is the problem? Tasks number?

Thank you for your help.

Hi,

There seems to be a huge logs, so I guess your issue might be caused by https://github.com/ansible/awx/issues/11338

You’re right!!! I don’t know how much time I wasted looking for a solution :frowning:

I’ve increased from 50MB (default in my aks nodepool ) to 100MB and the job templated completed without crash.

For verify aks nodepool configuration:
kubectl proxy
curl -sSL “http://localhost:8001/api/v1/nodes/NODE-NAME/proxy/configz” | python3 -m json.tool

The question now is… which is the right value for log rotation? 100, 200, 500?

Is there no possibility that this thing will fixed in a future AWX release? This is only a workaround… it works but it’s not a stable solution.

Thank you for you help!

Congrats :slight_smile:

The question now is… which is the right value for log rotation? 100, 200, 500?

I think there is no right value. That configuration affects not only logs for EE but also logs for all pods on the cluster, so it totally depends on the size of the log for your job and capacity of the node.

Is there no possibility that this thing will fixed in a future AWX release? This is only a workaround… it works but it’s not a stable solution.

The root cause is not in the AWX implementation, but in the behavior of the Kubernetes API.
Refer https://github.com/kubernetes/kubernetes/issues/59902
I agree with you that this issue should be fixed, but sadly seems take long time since the issue is kept opened for five years…

Regards,

I have an AKS cluster with 3 nodes dedicated to AWX so there should not be problems of capacity for the nodes. Maybe I will try 200MB or 500MB and I hope will be enough.

The very strange thing is that I’m migrating my automation from a very old AWX version (3.0.1.0) and same templates worked without errors. The 3.0.1.0 AWX version is not affected to this AKS cluster issue?

Regards,
Elia

For a specif template with about 100000 log lines I have increased log rotation to 500MB and now it’s working.
How is it possible with AWX 3.0.1.0 older version there isn’t any issue due to log rotation on aks?

Thank you for your help.

Hi,

How is it possible with AWX 3.0.1.0 older version there isn’t any issue due to log rotation on aks?

The root cause of the issue is in Kubernetes API, and the recent AWX (18+) is designed to use that API to gather logs from EE.
In my understanding, in the older AWX 3.x, no Kubernetes API is required to run job. There is no EE and your playbook run directly under virtualenv on AWX container.

I think that’s why this issue affects recent AWX only.

Regards,

Hi,
it makes sense what you say.

Thanks for the explanation!

Hello,
I’m having a similar issue with job terminated unexpectedly but this time is not due to log rotation that I have solved last time on my aks cluster.

In AWX I see the job (id: 144294) failed but I can’t visualize the job details, AWX UI try to load the log without success:

So direct
in awx-task container

2022-12-02 01:15:09,647 INFO [544931daa2964f35a9686d3004f8628b] awx.analytics.job_lifecycle job-144294 pre run
2022-12-02 01:15:09,861 INFO [544931daa2964f35a9686d3004f8628b] awx.analytics.job_lifecycle job-144294 preparing playbook
2022-12-02 01:15:09,996 INFO [544931daa2964f35a9686d3004f8628b] awx.analytics.job_lifecycle job-144294 running playbook
2022-12-02 01:15:10,332 INFO [544931daa2964f35a9686d3004f8628b] awx.analytics.job_lifecycle job-144294 work unit id received
2022-12-02 01:15:10,370 INFO [544931daa2964f35a9686d3004f8628b] awx.analytics.job_lifecycle job-144294 work unit id assigned
2022-12-02 02:01:11,660 INFO [544931daa2964f35a9686d3004f8628b] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 144294
2022-12-02 02:06:17,396 INFO [544931daa2964f35a9686d3004f8628b] awx.analytics.job_lifecycle job-144294 post run
2022-12-02 02:06:55,686 INFO [544931daa2964f35a9686d3004f8628b] awx.analytics.job_lifecycle job-144294 finalize run
2022-12-02 02:06:55,756 WARNING [544931daa2964f35a9686d3004f8628b] awx.main.dispatch job 144294 (error) encountered an error (rc=None), please see task stdout for details.

in awx-ee container
ERROR 2022/12/02 02:06:20 Exceeded retries for reading stdout /tmp/receptor/awx-c7647dbf5-85g85/FF2xD2Oc/stdout

I’ve also checked stdout directly from awx db in “awx_jobevent” table but there are not errors for this job id… probably the pod was terminated so I found last command executed.

Any ideas?
Thank you very much.

Elia

Hi, is this a consistent failure you are seeing in your deployment? based on logs, seems the job is running for about 45 minutes to completion.

For debugging purposes, you could set RECEPTOR_RELEASE_WORK in the django settings (see below), and after encountering this issue, check the receptor work unit by execing into the awx pod, awx-ee container, and cat the /tmp/receptor/awx-c7647dbf5-85g85/FF2xD2Oc/status file


extra_settings:
- setting: RECEPTOR_RELEASE_WORK
value: "False"

If you provide that info we can help further, thanks!

AWX Team

Hi We had the same issue with EKS and fixed with by increasing the log size, but want to know how you did at template leave. so that we can use only for a particular template

thanks in advance

increasing log size is a k8s cluster level setting (and will apply to all job pods launched), and cannot be set per AWX template. What reason did you have for wanting this at the template level?

AWX Team

Thanks for the clarity,

That is what we did for the target node using tolerance or affinity rules, but our dev environment goes for reboots overnight as budget constraints, this creates problem with configuration inconsistencies.

This is the case with EKS

Interested in this and forgive my ignorance but how did you increase the k8slogs?

We modified a node level kubelet configurations and updated the maxcontainerlogsize to 100mib from default 10.

Adding to we made sure the automation pods will only gets created on the same node using tolerance and node affinity rules through our deployment

but the same in a different EKS region was good with similar setup, still not sure why this is happening

We also modified the “maxcontainerlogsize” from 10 to 500M in AKS 1.25.

The problem occurs less often but is not solved.

I’m confused about the RECEPTOR_KUBE_SUPPORT_RECONNECT flag in the awx spec… What value should be set in this case to prevent the issue? Enabled or disabled?

Thank you very much for help.

Elia

We set it as disabled

RECEPTOR_KUBE_SUPPORT_RECONNECT should be enabled if you k8s kubelet version supports it. See here for the details of the versions https://github.com/ansible/receptor/pull/683

AWX Team

We are in aks 1.25.6 so it should be supported.

By the way this flag enable didn’t solved our issue for unexpected jobs terminations.

Elia