AWX job terminated unexpectedly

Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b'

this error message likely comes from ansible-runner/src/ansible_runner/streaming.py at 874f5cbc3296e0b305812b3fdb867d426fa7a759 · ansible/ansible-runner · GitHub

this happens in ansible-runner process

for example in our investigation on an AKS based deployment one case we see

"INFO 2023/11/09 13:19:30 [bMmtv8Hu] Detected Error: EOF for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 5 more times.", 
"WARNING 2023/11/09 13:19:30 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 5 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z&timestamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable", 
"WARNING 2023/11/09 13:19:32 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 4 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z&timestamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable", 
"WARNING 2023/11/09 13:19:33 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 3 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z&timestamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable", 
"WARNING 2023/11/09 13:19:34 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 2 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z&timestamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable", 
"WARNING 2023/11/09 13:19:36 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 1 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z&timestamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable", 
"ERROR 2023/11/09 13:19:37 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%!A(MISSING)00%!A(MISSING)00Z&timestamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable", 

in the awx-ee container

our speculated is that during the job execution the kube-apiserver ran into some problem and the log stream got disconnected, and when we try to reconnect we are not able to

what I’m confused about is the

https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%!A(MISSING)00%!A(MISSING)00Z&timestamps=true

specifically

sinceTime=0001-01-01T00%!A(MISSING)00%!A(MISSING)00Z

https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L249-L255
this is where

"INFO 2023/11/09 13:19:30 [bMmtv8Hu] Detected Error: EOF for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 5 more times.", 

occured

than it breaks out of this for loop
https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L238

and proceed back to the next iteration of
https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L204

it was able to successfully get the pod detail and got pass
https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L210

than it try to re-establish log stream to the job pod
https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L231

AND here’s the problem…

discarded the established and recorded sinceTime and started the getLog request with default value of time.Time

tossed up Fix K8S log reconnect timestamp tracking by TheRealHaoLiu · Pull Request #939 · ansible/receptor · GitHub to fix this bug

do we think this have something to do with the json parsing? it is possible but not definitive

I created a patched awx-ee quay.io/haoliu/awx-ee:receptor-939

If anyone that can reproduce this problem can help test this out I would be grateful.

1 Like