Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b'
this error message likely comes from ansible-runner/src/ansible_runner/streaming.py at 874f5cbc3296e0b305812b3fdb867d426fa7a759 · ansible/ansible-runner · GitHub
this happens in ansible-runner process
for example in our investigation on an AKS based deployment one case we see
"INFO 2023/11/09 13:19:30 [bMmtv8Hu] Detected Error: EOF for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 5 more times.",
"WARNING 2023/11/09 13:19:30 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 5 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z×tamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable",
"WARNING 2023/11/09 13:19:32 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 4 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z×tamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable",
"WARNING 2023/11/09 13:19:33 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 3 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z×tamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable",
"WARNING 2023/11/09 13:19:34 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 2 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z×tamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable",
"WARNING 2023/11/09 13:19:36 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 1 more times. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z×tamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable",
"ERROR 2023/11/09 13:19:37 [bMmtv8Hu] Error opening log stream for pod ansible-automation-platform/automation-job-31741-lb82g. Error: Get \"https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%!A(MISSING)00%!A(MISSING)00Z×tamps=true\": proxy error from localhost:9443 while dialing 10.49.253.13:10250, code 503: 503 Service Unavailable",
in the awx-ee container
our speculated is that during the job execution the kube-apiserver ran into some problem and the log stream got disconnected, and when we try to reconnect we are not able to
what I’m confused about is the
https://10.49.253.13:10250/containerLogs/ansible-automation-platform/automation-job-31741-lb82g/worker?follow=true&sinceTime=0001-01-01T00%!A(MISSING)00%!A(MISSING)00Z×tamps=true
specifically
sinceTime=0001-01-01T00%!A(MISSING)00%!A(MISSING)00Z
https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L249-L255
this is where
"INFO 2023/11/09 13:19:30 [bMmtv8Hu] Detected Error: EOF for pod ansible-automation-platform/automation-job-31741-lb82g. Will retry 5 more times.",
occured
than it breaks out of this for loop
https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L238
and proceed back to the next iteration of
https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L204
it was able to successfully get the pod detail and got pass
https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L210
than it try to re-establish log stream to the job pod
https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go/#L231
AND here’s the problem…
discarded the established and recorded sinceTime and started the getLog request with default value of time.Time
tossed up Fix K8S log reconnect timestamp tracking by TheRealHaoLiu · Pull Request #939 · ansible/receptor · GitHub to fix this bug
do we think this have something to do with the json parsing? it is possible but not definitive
I created a patched awx-ee quay.io/haoliu/awx-ee:receptor-939
If anyone that can reproduce this problem can help test this out I would be grateful.