AWX job terminated unexpectedly

Hello,
I’m having an issue with a job template on AWX that terminated unexpectedly after 5 minutes. No errors displayed in Stdout in AWX ui.

AKS version: 1.26.3
AWX version: 22.3.0

The issue occurs not so often, maybe once every 3/4 weeks. But when happens it’s very critical for us because jobs terminate unexpected and our automations stop in a randomic point so it’s difficult to recovery the previous state.

We tried to enable the RECEPTOR_KUBE_SUPPORT_RECONNECT flag but it didn’t solved our issue for unexpected jobs terminations.

We get the errors from AWX-EE container if could help:

DEBUG 2023/09/22 23:29:40 Client connected to control service @
DEBUG 2023/09/22 23:29:41 [qhGyttod] RECEPTOR_KUBE_CLIENTSET_QPS:
DEBUG 2023/09/22 23:29:41 [qhGyttod] RECEPTOR_KUBE_CLIENTSET_BURST:
DEBUG 2023/09/22 23:29:41 [qhGyttod] Initializing Kubernetes clientset
DEBUG 2023/09/22 23:29:41 [qhGyttod] QPS: 100.000000, Burst: 1000
DEBUG 2023/09/22 23:29:41 Control service closed
DEBUG 2023/09/22 23:29:41 Client disconnected from control service @
DEBUG 2023/09/22 23:29:41 Client connected to control service @
ERROR 2023/09/22 23:29:42 Exceeded retries for reading stdout /tmp/receptor/awx-6497dd64b6-gxfsw/BJC2anMq/stdout
WARNING 2023/09/22 23:29:42 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2023/09/22 23:29:42 Client disconnected from control service @
WARNING 2023/09/22 23:29:42 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2023/09/22 23:29:43 [qhGyttod] streaming stdout with reconnect support
DEBUG 2023/09/22 23:30:06 Sending service advertisement: &{awx-6497dd64b6-gxfsw control 2023-09-22 23:30:06.753464322 +0000 UTC m=+8587955.906785691 1 map[type:Control Service] [{local false} {kubernetes-runtime-auth false} {kubernetes-incluster-auth false}]}

Any ideas? Thank you very much for your help.

Elia

1 Like

Hi @epi82 , from personal experience my recommendation would be to check the memory on the awx-task pod in particular (also have a look at dmesg output if you can to validate this theory)

We suffered because we gave the awx-task pod too little memory (1Gi in our case) and the awx-manage process that oversees inventory-updates and playbook runs got OOMKilled - this was harder to find than if the awx-task pod itself had got OOMKilled!

2 Likes

Hello,

Just complementary to what @willthames suggested: I found myself on a similar situation when my minikube host exhausted its disk capacity, so that’s something you could check also.

cheers,

1 Like

Hello @willthames, thanks for your reply.

I haven’t any limit set for awx-task deployment.

          resources:
            requests:
              cpu: 100m
              memory: 128Mi

I think it was a different issue from OOMKilled.

do you end up getting any stdout from the job?

what is the exactly error message you see in AWX for that job?

You may try going into settings > Job settings and providing a value for

K8S Ansible Runner Keep-Alive Message Interval (maybe something like 15 seconds?)

see this discussion on github for more information (also a AKS setup)

2 Likes

Hello,
there any stdout for the job on AWX. The job is in error state without messages.

It happens sometimes and not depends from the job execution time (could be more or less than 5 minutes).

The discussion linked it’s a known issue resolved with newer AWX and AKS versions.

The issue seems due to something you can see in this discussion:
https://github.com/ansible/awx/issues/11338

I don’t understand if the issue was solved and how.

Thank you for your help.

1 Like