AWX job terminated unexpectedly

Hello,
I’m having an issue with a job template on AWX that terminated unexpectedly after 5 minutes. No errors displayed in Stdout in AWX ui.

AKS version: 1.26.3
AWX version: 22.3.0

The issue occurs not so often, maybe once every 3/4 weeks. But when happens it’s very critical for us because jobs terminate unexpected and our automations stop in a randomic point so it’s difficult to recovery the previous state.

We tried to enable the RECEPTOR_KUBE_SUPPORT_RECONNECT flag but it didn’t solved our issue for unexpected jobs terminations.

We get the errors from AWX-EE container if could help:

DEBUG 2023/09/22 23:29:40 Client connected to control service @
DEBUG 2023/09/22 23:29:41 [qhGyttod] RECEPTOR_KUBE_CLIENTSET_QPS:
DEBUG 2023/09/22 23:29:41 [qhGyttod] RECEPTOR_KUBE_CLIENTSET_BURST:
DEBUG 2023/09/22 23:29:41 [qhGyttod] Initializing Kubernetes clientset
DEBUG 2023/09/22 23:29:41 [qhGyttod] QPS: 100.000000, Burst: 1000
DEBUG 2023/09/22 23:29:41 Control service closed
DEBUG 2023/09/22 23:29:41 Client disconnected from control service @
DEBUG 2023/09/22 23:29:41 Client connected to control service @
ERROR 2023/09/22 23:29:42 Exceeded retries for reading stdout /tmp/receptor/awx-6497dd64b6-gxfsw/BJC2anMq/stdout
WARNING 2023/09/22 23:29:42 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2023/09/22 23:29:42 Client disconnected from control service @
WARNING 2023/09/22 23:29:42 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
DEBUG 2023/09/22 23:29:43 [qhGyttod] streaming stdout with reconnect support
DEBUG 2023/09/22 23:30:06 Sending service advertisement: &{awx-6497dd64b6-gxfsw control 2023-09-22 23:30:06.753464322 +0000 UTC m=+8587955.906785691 1 map[type:Control Service] [{local false} {kubernetes-runtime-auth false} {kubernetes-incluster-auth false}]}

Any ideas? Thank you very much for your help.

Elia

1 Like

Hi @epi82 , from personal experience my recommendation would be to check the memory on the awx-task pod in particular (also have a look at dmesg output if you can to validate this theory)

We suffered because we gave the awx-task pod too little memory (1Gi in our case) and the awx-manage process that oversees inventory-updates and playbook runs got OOMKilled - this was harder to find than if the awx-task pod itself had got OOMKilled!

3 Likes

Hello,

Just complementary to what @willthames suggested: I found myself on a similar situation when my minikube host exhausted its disk capacity, so that’s something you could check also.

cheers,

2 Likes

Hello @willthames, thanks for your reply.

I haven’t any limit set for awx-task deployment.

          resources:
            requests:
              cpu: 100m
              memory: 128Mi

I think it was a different issue from OOMKilled.

do you end up getting any stdout from the job?

what is the exactly error message you see in AWX for that job?

You may try going into settings > Job settings and providing a value for

K8S Ansible Runner Keep-Alive Message Interval (maybe something like 15 seconds?)

see this discussion on github for more information (also a AKS setup)

2 Likes

Hello,
there any stdout for the job on AWX. The job is in error state without messages.

It happens sometimes and not depends from the job execution time (could be more or less than 5 minutes).

The discussion linked it’s a known issue resolved with newer AWX and AKS versions.

The issue seems due to something you can see in this discussion:
https://github.com/ansible/awx/issues/11338

I don’t understand if the issue was solved and how.

Thank you for your help.

1 Like

Hello,
it happened again last night. If could help I try to add a picture.
As you can see the job status is “Failed” after 1h 30min (the time elapsed could be greater or lesser, it doesn’t depend from this) and NO errors in STDOUT.

Thank you for your help.

Hello @epi82,

With the provided information the only thing I could say is that the amount of tasks you’re executing on this job is just monstrous! I mean, 10k tasks for an AWX is quite a lot… I’ve never seen that amount, really! And the output lines growing up to 35k… That’s a veeery large log also. I was about to suggest you enabling verbosity on the template job, but this way logs could grow enormous.

Have you considered splitting this Job on several smaller ones and then execute them serially on a Workflow Job Template? This would make the debugging task easier, and what’s more important, it would also prevent exhausting the pod’s resources assigned to the job container (if that’s the problem you’re facing).

For example, you could start by separating the two plays you got on the playbook onto two different Job Templates (Divide and Conquer is always a good strategy for debugging). Then, see if the problem persists. If it does, you may try splitting further whichever is failing.

PS: I say all of this without knowing what your playbook does, IDK if that would be an option for you but I believe it’s worth a shot.

2 Likes

Hello @jbericat and thank you for you reply.

The point is that the error is not due to the template complexity. We are monitoring these “random” failed job for a long time. They happen approximately every 2/3 weeks and could happens with a log task (like the one in the picture) or a simple one.

It seems something due to an aks event like a log rotate and when it happens any automation job (pod running on aks) will fail with no errors.
We’ve tried to increase the standard log dimension for aks (“containerLogMaxSizeMB” property) but it didn’t solve. Simply the issue happens less frequently.
We’ve tried enabling the flag “RECEPTOR_KUBE_SUPPORT_RECONNECT” but the same.

I don’t know which more we can do.

1 Like

I been debugging a very similar issue lately (also on AKS… and very intermittent)

presentation of the issue

Job failed with

Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b

with error in receptor log like

Exceeded retries for reading stdout /tmp/receptor/automation-controller-task-668d6ddd55-jsm6x/bBXPIvLh/stdout

(I have not been able to reproduce this personally)

So we set RECEPTOR_RELEASE_WORK and AWX_CLEANUP_PATHS to False and see

this will preserve the workunit and private_data_dir on the ee container AND the pod that runs the job (WARNING: !!!WILL CAUSE ADDITIONAL STORAGE USAGE!!! and should be turn back on after)

run jobs till the failure happens than

grab the stdout from the job pod it should be something like

oc logs automation-job-<job_id>-<pod_id>

exec into the ee container on task pod

oc exec -it deployment/<awx>-task -c <awx>-ee -- bash

find the stdout file in the receptor workunit dir

/tmp/receptor/awx-task-<replicaset_id>-<pod_id>/<workunit>

compare STDOUT from the pod to the stdout file

question: do they match and do they both contain

{zipfile: <size>}
<base64 encoded zipfile>{"eof": true}

at the end of the file

If they both contain the zipfile portion play it through
and play it through ansible-runner process

cat stdout | ansible-runner process .

this will produce the stdout (to screen) AND the artifact directory transmitted back from ansible-runner worker

I’m just gathering datas… so far this is only reported on AKS

2 Likes

if this is cause by log rotation than i would expect the log from pod and from the receptor stdout WILL NOT match

1 Like

This type of debugging is rock solid and I feel should be documented more centrally, maybe:

1 Like

don’t know why i can not continue to edit this message and add more information…

but I’m going to continue to append debug/investigation that we can go through to this post… (might be a bit stream of contagiousness and less organized tho…)

@gwmngilfen why can’t i edit my previous post?

1 Like

Because there’s a timeout on being able to edit. At some you have to say “this is now part of the history of the project and we shouldn’t mess with it”. Your post was well over month ago, editing it now is well beyond “I made a typo” or “I forgot to add something”

1 Like
Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b'

this error message likely comes from ansible-runner/src/ansible_runner/streaming.py at 874f5cbc3296e0b305812b3fdb867d426fa7a759 · ansible/ansible-runner · GitHub

this happens in ansible-runner process

for example in our investigation on an AKS based deployment one case we see

"INFO 2023/11/09 13:19:30 [bMmtv8Hu] Detected Error: EOF for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 5 more times.", 
"WARNING 2023/11/09 13:19:30 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 5 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z&timestamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable", 
"WARNING 2023/11/09 13:19:32 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 4 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z&timestamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable", 
"WARNING 2023/11/09 13:19:33 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 3 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z&timestamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable", 
"WARNING 2023/11/09 13:19:34 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 2 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z&timestamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable", 
"WARNING 2023/11/09 13:19:36 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 1 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z&timestamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable", 
"ERROR 2023/11/09 13:19:37 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%!A(MISSING)00%!A(MISSING)00Z&timestamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable", 

in the awx-ee container

our speculated is that during the job execution the kube-apiserver ran into some problem and the log stream got disconnected, and when we try to reconnect we are not able to

what I’m confused about is the

https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%!A(MISSING)00%!A(MISSING)00Z&timestamps=true

specifically

sinceTime=0001-01-01T00%!A(MISSING)00%!A(MISSING)00Z

https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L249-L255
this is where

"INFO 2023/11/09 13:19:30 [bMmtv8Hu] Detected Error: EOF for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 5 more times.", 

occured

than it breaks out of this for loop
https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L238

and proceed back to the next iteration of
https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L204

it was able to successfully get the pod detail and got pass
https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L210

than it try to re-establish log stream to the job pod
https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L231

AND here’s the problem…

discarded the established and recorded sinceTime and started the getLog request with default value of time.Time

tossed up Fix K8S log reconnect timestamp tracking by TheRealHaoLiu · Pull Request #939 · ansible/receptor · GitHub to fix this bug

do we think this have something to do with the json parsing? it is possible but not definitive

I created a patched awx-ee quay.io/haoliu/awx-ee:receptor-939

If anyone that can reproduce this problem can help test this out I would be grateful.

1 Like

Just a question to be sure, how did you configure RECEPTOR_KUBE_SUPPORT_RECONNECT ?

Now for the pod “awx-task-xxx” I have set the environment “RECEPTOR_KUBE_SUPPORT_RECONNECT : enabled” for the cointainer “awx-ee”.

Is it right?
Thank you.

Thanks, you are right :smiley:
I missed this line in the log you posted, sorry.

DEBUG 2023/09/22 23:29:43 [qhGyttod] streaming stdout with reconnect support

This line is not output unless the reconnect feature is enabled, so it looks like the configuration is correct.

Sorry for the additional questions, but could you please provide:

  • The value of containerLogMaxSizeMB for your AKS node pools (where automation job pod will be running on)
  • The node size (D2s_v3, B2s, …) of your node pools

Yes sure.

  • “containerLogMaxSize”: “500M” - I have increased it (default should be 50M) to reduce the number of times the problem occurs. I think it is due to the aks log rotation.

  • Node size: Standard_DS3_v2

Thank you for your help.

Thanks for clarifing!

Now @TheRealHaoLiu works a lot on this in Failed to JSON parse a line from worker stream due to unexpected EOF(b'') · Issue #14693 · ansible/awx · GitHub and quay.io/ansible/awx-ee:latest is updated.

Note that the fix is for the EE in the control plane (awx-ee container in awx-task pod), so make sure control plane uses updated images. This can’t be achieved via UI (modifying Control Plane Execution Environment in UI does not helps).

To make control plane use updated image, I recommend either of the following:

  • A) Add image_pull_policy: Always to AWX CR and re-deploy AWX to trigger restart for task pod (we can remove image_pull_policy later)
  • B) Remove cached quay.io/ansible/awx-ee:latest image on all K8s nodes and restart task pod
2 Likes