AWX job terminated unexpectedly

@kurokobo Thank you for the reply.

For A) the “image_pull_policy” I think I can temporarily modify the awx-operator spec and set it to “Always” as I can see here:
https://docs.ansible.com/automation-controller/4.0.0/html/administration/operator_advanced_configurations.html#deploy-a-specific-version-of-awx

For B) any advice on how to remove the cached image “awx-ee:latest” from AKS?

@kurokobo for point B what is the best way to do this?

Thank you very much.

I don’t know which is the best, but there are some solutions:

1 Like

Thank you @kurokobo for your reply. I will try one of these.

After this there is a way to be sure that I’m using latest image? Maybe something from pod logs… I don’t know.

You can gather sha256 of the image by containerStatusess from kubectl get pod <pod-name> -o yaml:

$ kubectl -n awx get pod awx-task-97b468984-n5tmf -o yaml
...
status:
  ...
  containerStatuses:
  - ...
    image: quay.io/ansible/awx-ee:latest
    imageID: quay.io/ansible/awx-ee@sha256:a4e53d8d95a90fea0c72a62b7c42eb15e52838841f43bf2dac64dc3fdc429f56
    ...
    name: awx-ee
    ...

The sha256 of the latest latest image can be retrieved via https://quay.io/repository/ansible/awx-ee?tab=tags&tag=latest

The latest tag is updated every 12 hours, so it will quickly become not the latest and will disappear from the list, but at least it will include fixes for the issues if it is up-to-date at this time.

1 Like

im thinking pinning awx-ee image in the operator…to the versioned awx-ee image

thoughts?

Of course it should be :smiley:

Related to:

2 Likes

It would be nice if it was more like the other *_image specs as well:

control_plane_ee_image: quay.io/ansible/awx-ee
control_plane_ee_image_version: 23.8.1
1 Like

Thanks for clarifing!

I’ve checked the “sha256” for the “awx-ee:latest” and I think it’s quite updated.

By the way yesterday night it happened again the same issue with unexpected terminated job on AWX. So I think the latest mods on awx-ee:latest didn’t solve it.

This should be the error directly from pod awx-task:

INFO 2024/03/26 00:10:30 Detected Error: EOF for pod awx/automation-job-182941-9cj2d. Will retry 5 more times.

WARNING 2024/03/26 00:10:30 Error opening log stream for pod awx/automation-job-182941-9cj2d. Will retry 5 more times. Error: Get "https://10.1.42.122:10250/containerLogs/awx/automation-job-182941-9cj2d/worker?follow=true&sinceTime=2024-03-26T00%3A10%3A28Z&timestamps=true": proxy error from localhost:9443 while dialing 10.1.42.122:10250, code 503: 503 Service Unavailable

WARNING 2024/03/26 00:10:31 Error opening log stream for pod awx/automation-job-182941-9cj2d. Will retry 4 more times. Error: Get "https://10.1.42.122:10250/containerLogs/awx/automation-job-182941-9cj2d/worker?follow=true&sinceTime=2024-03-26T00%3A10%3A28Z&timestamps=true": proxy error from localhost:9443 while dialing 10.1.42.122:10250, code 503: 503 Service Unavailable

WARNING 2024/03/26 00:10:32 Error opening log stream for pod awx/automation-job-182941-9cj2d. Will retry 3 more times. Error: Get "https://10.1.42.122:10250/containerLogs/awx/automation-job-182941-9cj2d/worker?follow=true&sinceTime=2024-03-26T00%3A10%3A28Z&timestamps=true": proxy error from localhost:9443 while dialing 10.1.42.122:10250, code 503: 503 Service Unavailable

WARNING 2024/03/26 00:10:33 Error opening log stream for pod awx/automation-job-182941-9cj2d. Will retry 2 more times. Error: Get "https://10.1.42.122:10250/containerLogs/awx/automation-job-182941-9cj2d/worker?follow=true&sinceTime=2024-03-26T00%3A10%3A28Z&timestamps=true": proxy error from localhost:9443 while dialing 10.1.42.122:10250, code 503: 503 Service Unavailable

WARNING 2024/03/26 00:10:34 Error opening log stream for pod awx/automation-job-182941-9cj2d. Will retry 1 more times. Error: Get "https://10.1.42.122:10250/containerLogs/awx/automation-job-182941-9cj2d/worker?follow=true&sinceTime=2024-03-26T00%3A10%3A28Z&timestamps=true": proxy error from localhost:9443 while dialing 10.1.42.122:10250, code 503: 503 Service Unavailable

ERROR 2024/03/26 00:10:35 Error opening log stream for pod awx/automation-job-182941-9cj2d. Error: Get "https://10.1.42.122:10250/containerLogs/awx/automation-job-182941-9cj2d/worker?follow=true&sinceTime=2024-03-26T00%!A(MISSING)10%!A(MISSING)28Z&timestamps=true": proxy error from localhost:9443 while dialing 10.1.42.122:10250, code 503: 503 Service Unavailable

WARNING 2024/03/26 00:10:36 Could not read in control service: read unix /var/run/receptor/receptor.sock->@: use of closed network connection

WARNING 2024/03/26 00:10:36 Could not close connection: close unix /var/run/receptor/receptor.sock->@: use of closed network connection

ERROR 2024/03/26 00:10:36 Error deleting pod automation-job-182941-9cj2d: client rate limiter Wait returned an error: context canceled

If there is any other check I can do let me know. Thank you very much for your support.

Hello,
yesterday night happened again but I think with a different message… I try to share it to you @kurokobo and @TheRealHaoLiu


Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b’’

In awx-task pod I’ve found this error:

2024-04-16T00:13:44+02:00 2024-04-15 22:13:44,409 ERROR    [6a992cbb5ae249be87f608f6d08258ba] awx.main.dispatch Worker failed to run task awx.main.tasks.system.purge_old_stdout_files(*[], **{}
2024-04-16T00:13:44+02:00 Traceback (most recent call last):
2024-04-16T00:13:44+02:00   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 103, in perform_work
2024-04-16T00:13:44+02:00     result = self.run_callable(body)
2024-04-16T00:13:44+02:00   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/dispatch/worker/task.py", line 78, in run_callable
2024-04-16T00:13:44+02:00     return _call(*args, **kwargs)
2024-04-16T00:13:44+02:00   File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/main/tasks/system.py", line 378, in purge_old_stdout_files
2024-04-16T00:13:44+02:00     for f in os.listdir(settings.JOBOUTPUT_ROOT):
2024-04-16T00:13:44+02:00 FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/awx/job_status'

Any ideas? Thank you very much for your help.