Jobs failing with ERROR with partial log

Hello,

We are experiencing issues where jobs randomly fail with status code ERROR and partial logs. When a job fails if you restart it enough times it will eventually complete successfully. It also does not seem to be time specific as the same job can fail in less than 5 minutes with couple of lines in the log or more than an hour with hundreds of lines in the log.

AWX version is 21.8.0 running on AKS 1.22.15.

If anyone has encountered similar behavior and has any ideas I would be thankful.

can you provide the output of /api/v2/jobs/<job_id> for one of the failed jobs (remove sensitive info)

We can take a look and see if anything stands out

AWX Team

Hi,

Attached is an output of a failed job as requested.

(attachments)

AWX.txt (86.2 KB)

I´m not sure if it´s the same for you, but there was an issue with long running jobs being terminated unexpectedly: https://github.com/ansible/awx/issues/11594

It was addressed here: https://github.com/ansible/receptor/pull/683

As Tian mentioned, you could be running into the timeout issues. These are resolved in latest. Do you mind upgrading to latest AWX and retrying?

AWX Team

Unfortunately I don’t think it is related to the linked timeout issue as jobs can fail within minutes.
As for upgrading we have to discuss this with our client as the final decision is theirs.