I don’t know which is the best, but there are some solutions:
Deploy dummy pod that uses awx-ee:latest with Always pull policy on each nodes (This is not a way to remove cached images but a way to renew cached images by forcibly re-pulling the images)
Connecting to AKS nodes and use crictl to remove cached images
The latest tag is updated every 12 hours, so it will quickly become not the latest and will disappear from the list, but at least it will include fixes for the issues if it is up-to-date at this time.
By the way yesterday night it happened again the same issue with unexpected terminated job on AWX. So I think the latest mods on awx-ee:latest didn’t solve it.
This should be the error directly from pod awx-task:
INFO 2024/03/26 00:10:30 Detected Error: EOF for pod awx/automation-job-182941-9cj2d. Will retry 5 more times.
WARNING 2024/03/26 00:10:30 Error opening log stream for pod awx/automation-job-182941-9cj2d. Will retry 5 more times. Error: Get "https://10.1.42.122:10250/containerLogs/awx/automation-job-182941-9cj2d/worker?follow=true&sinceTime=2024-03-26T00%3A10%3A28Z×tamps=true": proxy error from localhost:9443 while dialing 10.1.42.122:10250, code 503: 503 Service Unavailable
WARNING 2024/03/26 00:10:31 Error opening log stream for pod awx/automation-job-182941-9cj2d. Will retry 4 more times. Error: Get "https://10.1.42.122:10250/containerLogs/awx/automation-job-182941-9cj2d/worker?follow=true&sinceTime=2024-03-26T00%3A10%3A28Z×tamps=true": proxy error from localhost:9443 while dialing 10.1.42.122:10250, code 503: 503 Service Unavailable
WARNING 2024/03/26 00:10:32 Error opening log stream for pod awx/automation-job-182941-9cj2d. Will retry 3 more times. Error: Get "https://10.1.42.122:10250/containerLogs/awx/automation-job-182941-9cj2d/worker?follow=true&sinceTime=2024-03-26T00%3A10%3A28Z×tamps=true": proxy error from localhost:9443 while dialing 10.1.42.122:10250, code 503: 503 Service Unavailable
WARNING 2024/03/26 00:10:33 Error opening log stream for pod awx/automation-job-182941-9cj2d. Will retry 2 more times. Error: Get "https://10.1.42.122:10250/containerLogs/awx/automation-job-182941-9cj2d/worker?follow=true&sinceTime=2024-03-26T00%3A10%3A28Z×tamps=true": proxy error from localhost:9443 while dialing 10.1.42.122:10250, code 503: 503 Service Unavailable
WARNING 2024/03/26 00:10:34 Error opening log stream for pod awx/automation-job-182941-9cj2d. Will retry 1 more times. Error: Get "https://10.1.42.122:10250/containerLogs/awx/automation-job-182941-9cj2d/worker?follow=true&sinceTime=2024-03-26T00%3A10%3A28Z×tamps=true": proxy error from localhost:9443 while dialing 10.1.42.122:10250, code 503: 503 Service Unavailable
ERROR 2024/03/26 00:10:35 Error opening log stream for pod awx/automation-job-182941-9cj2d. Error: Get "https://10.1.42.122:10250/containerLogs/awx/automation-job-182941-9cj2d/worker?follow=true&sinceTime=2024-03-26T00%!A(MISSING)10%!A(MISSING)28Z×tamps=true": proxy error from localhost:9443 while dialing 10.1.42.122:10250, code 503: 503 Service Unavailable
WARNING 2024/03/26 00:10:36 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection
WARNING 2024/03/26 00:10:36 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection
ERROR 2024/03/26 00:10:36 Error deleting pod automation-job-182941-9cj2d: client rate limiter Wait returned an error: context canceled
If there is any other check I can do let me know. Thank you very much for your support.
Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b’’
In awx-task pod I’ve found this error:
2024-04-16T00:13:44+02:00 2024-04-15 22:13:44,409 ERROR [6a992cbb5ae249be87f608f6d08258ba] awx.main.dispatch Worker failed to run task awx.main.tasks.system.purge_old_stdout_files(*[], **{}
2024-04-16T00:13:44+02:00 Traceback (most recent call last):
2024-04-16T00:13:44+02:00 File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 103, in perform_work
2024-04-16T00:13:44+02:00 result = self.run_callable(body)
2024-04-16T00:13:44+02:00 File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 78, in run_callable
2024-04-16T00:13:44+02:00 return _call(*args, **kwargs)
2024-04-16T00:13:44+02:00 File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 378, in purge_old_stdout_files
2024-04-16T00:13:44+02:00 for f in os.listdir(settings.JOBOUTPUT_ROOT):
2024-04-16T00:13:44+02:00 FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/awx/job_status'