AWX marks long running jobs in ERROR, though the job completes successfully

We have a long running asynchronous job which will run for 5 hours. After job completes, instead of marking it as successful, AWX marks it as ERROR. We had this error in AWX version 19.4.0.

If anyone is aware of this issue, please help.

My experience with 19.1.0 is a bit different, but I have jobs that will always stop at 4 hours. The pod running the job is terminated and AWX marks the job as ERROR. Some research suggests that this could be related to Kubernetes log rotations, but I’ve been unable to figure out a solution that get’s any job to run longer than 4 hours.

Hey folks, we are aware of this issue on the team and have an issue to track it here: https://github.com/ansible/awx/issues/11451
It seems like this is a kubernetes specific problem that we can’t resolve on our end, but there are some workarounds in this issue that may help you.

Thanks a lot for the message. Please share the workaround that we can use here to overcome this issue.

Look at the linked issues in the issue that Becca posted previously. The fix depends on your distro.

I can confirm that the fix works for me. But only in my onprem k8s(and aks). Problem I have is that on a managed platform you cannot change this.

Fortunately for me we also use azure aks and there you can create a deamon set to change the settings.

That being said: altering k8s log rotation should not be the solution but a work around.

Should receptor be changed to log to a file in the container that can be polled by awx? I can imagine this requires a design change so best that a lead maintainer gives his two cent.

If this is a “won’t fix” since this is a k8s issue, it would be nice if this is added in the documentation?

Kind regards,

Thanks for confirming it works. Essentially, this is a long standing kubernetes issue, not an AWX one per se. Until that is fixed, this problem will persist and the workaround will need to be applied. We may look at the code and decide to make a change, but that currently isn’t on the radar. FYI, in openshift this isn’t a problem, as we use a different logging mechanism, so this doesn’t affect our enterprise customers (as that’s the platform we support for AAP).

So your next question might be then why not just use the openshift codebase? Well that might be a longer term option, but would need to be assessed and factored into development, and as I say for now, it’s not on the radar unfortunately.

Hello guys,
i already followed the issue you’re talking about but in my scenario this does not help.
I tried in K3S (single-node installation) the settings below:
–kubelet-arg container-log-max-files=2
–kubelet-arg container-log-max-size=500Mi

Time ago i replied to an issue in AWX EE repo too 'cause i thought about a possible issue on runnner idle timeout.
You find here my use case: https://github.com/ansible/awx-ee/issues/80

My playbook (that include a role) appears failed always after 4 hrs.
Thank you

Best,
Claudio