Hello,
Could you provide us with any error messages you are seeing in your AWX EE container for the job that fails? We would also like to know approximately how many jobs out of 100 are failing this way. Thank you for the additional information!
Hello,
I’ll try to check on error messages next time when the issue will happens.
The issue occurs not so often, maybe once every 2/3 months. But when happens it’s very critical for us because jobs terminate unexpected and our automations stop in a randomic point so it’s difficult to recovery the previous state.
Hello,
it happened again but this time I get the errors from AWX-EE container if could help:
DEBUG 2023/09/22 23:29:40 Client connected to control service @ DEBUG 2023/09/22 23:29:41 [qhGyttod] RECEPTOR_KUBE_CLIENTSET_QPS: DEBUG 2023/09/22 23:29:41 [qhGyttod] RECEPTOR_KUBE_CLIENTSET_BURST: DEBUG 2023/09/22 23:29:41 [qhGyttod] Initializing Kubernetes clientset DEBUG 2023/09/22 23:29:41 [qhGyttod] QPS: 100.000000, Burst: 1000 DEBUG 2023/09/22 23:29:41 Control service closed DEBUG 2023/09/22 23:29:41 Client disconnected from control service @ DEBUG 2023/09/22 23:29:41 Client connected to control service @ ERROR 2023/09/22 23:29:42 Exceeded retries for reading stdout /tmp/receptor/awx-6497dd64b6-gxfsw/BJC2anMq/stdout WARNING 2023/09/22 23:29:42 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection DEBUG 2023/09/22 23:29:42 Client disconnected from control service @ WARNING 2023/09/22 23:29:42 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection DEBUG 2023/09/22 23:29:43 [qhGyttod] streaming stdout with reconnect support DEBUG 2023/09/22 23:30:06 Sending service advertisement: &{awx-6497dd64b6-gxfsw control 2023-09-22 23:30:06.753464322 +0000 UTC m=+8587955.906785691 1 map[type:Control Service] [{local false} {kubernetes-runtime-auth false} {kubernetes-incluster-auth false}]}