Hello,
I’m having an issue with a job template on AWX that terminated unexpectedly after 5 minutes. No errors displayed in Stdout in AWX ui.
AKS version: 1.26.3
AWX version: 22.3.0
The issue occurs not so often, maybe once every 3/4 weeks. But when happens it’s very critical for us because jobs terminate unexpected and our automations stop in a randomic point so it’s difficult to recovery the previous state.
We tried to enable the RECEPTOR_KUBE_SUPPORT_RECONNECT flag but it didn’t solved our issue for unexpected jobs terminations.
We get the errors from AWX-EE container if could help:
DEBUG 2023/09/22 23:29:40 Client connected to control service @
DEBUG 2023/09/22 23:29:41 [qhGyttod] RECEPTOR_KUBE_CLIENTSET_QPS:
DEBUG 2023/09/22 23:29:41 [qhGyttod] RECEPTOR_KUBE_CLIENTSET_BURST:
DEBUG 2023/09/22 23:29:41 [qhGyttod] Initializing Kubernetes clientset
DEBUG 2023/09/22 23:29:41 [qhGyttod] QPS: 100.000000, Burst: 1000
DEBUG 2023/09/22 23:29:41 Control service closed
DEBUG 2023/09/22 23:29:41 Client disconnected from control service @
DEBUG 2023/09/22 23:29:41 Client connected to control service @
ERROR 2023/09/22 23:29:42 Exceeded retries for reading stdout /tmp/receptor/awx-6497dd64b6-gxfsw/BJC2anMq/stdout
WARNING 2023/09/22 23:29:42 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2023/09/22 23:29:42 Client disconnected from control service @
WARNING 2023/09/22 23:29:42 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2023/09/22 23:29:43 [qhGyttod] streaming stdout with reconnect support
DEBUG 2023/09/22 23:30:06 Sending service advertisement: &{awx-6497dd64b6-gxfsw control 2023-09-22 23:30:06.753464322 +0000 UTC m=+8587955.906785691 1 map[type:Control Service] [{local false} {kubernetes-runtime-auth false} {kubernetes-incluster-auth false}]}
Hi @epi82 , from personal experience my recommendation would be to check the memory on the awx-task pod in particular (also have a look at dmesg output if you can to validate this theory)
We suffered because we gave the awx-task pod too little memory (1Gi in our case) and the awx-manage process that oversees inventory-updates and playbook runs got OOMKilled - this was harder to find than if the awx-task pod itself had got OOMKilled!
Just complementary to what @willthames suggested: I found myself on a similar situation when my minikube host exhausted its disk capacity, so that’s something you could check also.
Hello,
it happened again last night. If could help I try to add a picture.
As you can see the job status is “Failed” after 1h 30min (the time elapsed could be greater or lesser, it doesn’t depend from this) and NO errors in STDOUT.
With the provided information the only thing I could say is that the amount of tasks you’re executing on this job is just monstrous! I mean, 10k tasks for an AWX is quite a lot… I’ve never seen that amount, really! And the output lines growing up to 35k… That’s a veeery large log also. I was about to suggest you enabling verbosity on the template job, but this way logs could grow enormous.
Have you considered splitting this Job on several smaller ones and then execute them serially on a Workflow Job Template? This would make the debugging task easier, and what’s more important, it would also prevent exhausting the pod’s resources assigned to the job container (if that’s the problem you’re facing).
For example, you could start by separating the two plays you got on the playbook onto two different Job Templates (Divide and Conquer is always a good strategy for debugging). Then, see if the problem persists. If it does, you may try splitting further whichever is failing.
PS: I say all of this without knowing what your playbook does, IDK if that would be an option for you but I believe it’s worth a shot.
The point is that the error is not due to the template complexity. We are monitoring these “random” failed job for a long time. They happen approximately every 2/3 weeks and could happens with a log task (like the one in the picture) or a simple one.
It seems something due to an aks event like a log rotate and when it happens any automation job (pod running on aks) will fail with no errors.
We’ve tried to increase the standard log dimension for aks (“containerLogMaxSizeMB” property) but it didn’t solve. Simply the issue happens less frequently.
We’ve tried enabling the flag “RECEPTOR_KUBE_SUPPORT_RECONNECT” but the same.
I been debugging a very similar issue lately (also on AKS… and very intermittent)
presentation of the issue
Job failed with
Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b
with error in receptor log like
Exceeded retries for reading stdout /tmp/receptor/automation-controller-task-668d6ddd55-jsm6x/bBXPIvLh/stdout
(I have not been able to reproduce this personally)
So we set RECEPTOR_RELEASE_WORK and AWX_CLEANUP_PATHS to False and see
this will preserve the workunit and private_data_dir on the ee container AND the pod that runs the job (WARNING: !!!WILL CAUSE ADDITIONAL STORAGE USAGE!!! and should be turn back on after)
run jobs till the failure happens than
grab the stdout from the job pod it should be something like
don’t know why i can not continue to edit this message and add more information…
but I’m going to continue to append debug/investigation that we can go through to this post… (might be a bit stream of contagiousness and less organized tho…)
Because there’s a timeout on being able to edit. At some you have to say “this is now part of the history of the project and we shouldn’t mess with it”. Your post was well over month ago, editing it now is well beyond “I made a typo” or “I forgot to add something”
for example in our investigation on an AKS based deployment one case we see
"INFO 2023/11/09 13:19:30 [bMmtv8Hu] Detected Error: EOF for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 5 more times.",
"WARNING 2023/11/09 13:19:30 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 5 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z×tamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable",
"WARNING 2023/11/09 13:19:32 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 4 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z×tamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable",
"WARNING 2023/11/09 13:19:33 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 3 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z×tamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable",
"WARNING 2023/11/09 13:19:34 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 2 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z×tamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable",
"WARNING 2023/11/09 13:19:36 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 1 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z×tamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable",
"ERROR 2023/11/09 13:19:37 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%!A(MISSING)00%!A(MISSING)00Z×tamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable",
in the awx-ee container
our speculated is that during the job execution the kube-apiserver ran into some problem and the log stream got disconnected, and when we try to reconnect we are not able to
“containerLogMaxSize”: “500M” - I have increased it (default should be 50M) to reduce the number of times the problem occurs. I think it is due to the aks log rotation.
Note that the fix is for the EE in the control plane (awx-ee container in awx-task pod), so make sure control plane uses updated images. This can’t be achieved via UI (modifying Control Plane Execution Environment in UI does not helps).
To make control plane use updated image, I recommend either of the following:
A) Add image_pull_policy: Always to AWX CR and re-deploy AWX to trigger restart for task pod (we can remove image_pull_policy later)
B) Remove cached quay.io/ansible/awx-ee:latest image on all K8s nodes and restart task pod