Hello,
I’m having an issue with a job template on AWX that terminated unexpectedly after 5 minutes. No errors displayed in Stdout in AWX ui.
AKS version: 1.26.3
AWX version: 22.3.0
The issue occurs not so often, maybe once every 3/4 weeks. But when happens it’s very critical for us because jobs terminate unexpected and our automations stop in a randomic point so it’s difficult to recovery the previous state.
We tried to enable the RECEPTOR_KUBE_SUPPORT_RECONNECT flag but it didn’t solved our issue for unexpected jobs terminations.
We get the errors from AWX-EE container if could help:
DEBUG 2023/09/22 23:29:40 Client connected to control service @
DEBUG 2023/09/22 23:29:41 [qhGyttod] RECEPTOR_KUBE_CLIENTSET_QPS:
DEBUG 2023/09/22 23:29:41 [qhGyttod] RECEPTOR_KUBE_CLIENTSET_BURST:
DEBUG 2023/09/22 23:29:41 [qhGyttod] Initializing Kubernetes clientset
DEBUG 2023/09/22 23:29:41 [qhGyttod] QPS: 100.000000, Burst: 1000
DEBUG 2023/09/22 23:29:41 Control service closed
DEBUG 2023/09/22 23:29:41 Client disconnected from control service @
DEBUG 2023/09/22 23:29:41 Client connected to control service @
ERROR 2023/09/22 23:29:42 Exceeded retries for reading stdout /tmp/receptor/awx-6497dd64b6-gxfsw/BJC2anMq/stdout
WARNING 2023/09/22 23:29:42 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2023/09/22 23:29:42 Client disconnected from control service @
WARNING 2023/09/22 23:29:42 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2023/09/22 23:29:43 [qhGyttod] streaming stdout with reconnect support
DEBUG 2023/09/22 23:30:06 Sending service advertisement: &{awx-6497dd64b6-gxfsw control 2023-09-22 23:30:06.753464322 +0000 UTC m=+8587955.906785691 1 map[type:Control Service] [{local false} {kubernetes-runtime-auth false} {kubernetes-incluster-auth false}]}
Hi @epi82 , from personal experience my recommendation would be to check the memory on the awx-task pod in particular (also have a look at dmesg output if you can to validate this theory)
We suffered because we gave the awx-task pod too little memory (1Gi in our case) and the awx-manage process that oversees inventory-updates and playbook runs got OOMKilled - this was harder to find than if the awx-task pod itself had got OOMKilled!
Just complementary to what @willthames suggested: I found myself on a similar situation when my minikube host exhausted its disk capacity, so that’s something you could check also.