We are experiencing issues where jobs randomly fail with status code ERROR and partial logs. When a job fails if you restart it enough times it will eventually complete successfully. It also does not seem to be time specific as the same job can fail in less than 5 minutes with couple of lines in the log or more than an hour with hundreds of lines in the log.
AWX version is 21.8.0 running on AKS 1.22.15.
If anyone has encountered similar behavior and has any ideas I would be thankful.
Unfortunately I don’t think it is related to the linked timeout issue as jobs can fail within minutes.
As for upgrading we have to discuss this with our client as the final decision is theirs.