AWX marks long running jobs in ERROR, though the job completes successfully

Kannan_Somaiah · January 6, 2022, 4:16pm

We have a long running asynchronous job which will run for 5 hours. After job completes, instead of marking it as successful, AWX marks it as ERROR. We had this error in AWX version 19.4.0.

If anyone is aware of this issue, please help.

Destin_Strader · January 6, 2022, 4:58pm

My experience with 19.1.0 is a bit different, but I have jobs that will always stop at 4 hours. The pod running the job is terminated and AWX marks the job as ERROR. Some research suggests that this could be related to Kubernetes log rotations, but I’ve been unable to figure out a solution that get’s any job to run longer than 4 hours.

Rebeccah_Hunter · January 10, 2022, 4:25pm

Hey folks, we are aware of this issue on the team and have an issue to track it here: https://github.com/ansible/awx/issues/11451
It seems like this is a kubernetes specific problem that we can’t resolve on our end, but there are some workarounds in this issue that may help you.

Kannan_Somaiah · January 11, 2022, 7:12am

Thanks a lot for the message. Please share the workaround that we can use here to overcome this issue.

phil.griffiths · January 12, 2022, 11:49am

Look at the linked issues in the issue that Becca posted previously. The fix depends on your distro.

Stefan_Coussens · January 12, 2022, 1:02pm

I can confirm that the fix works for me. But only in my onprem k8s(and aks). Problem I have is that on a managed platform you cannot change this.

Fortunately for me we also use azure aks and there you can create a deamon set to change the settings.

That being said: altering k8s log rotation should not be the solution but a work around.

Should receptor be changed to log to a file in the container that can be polled by awx? I can imagine this requires a design change so best that a lead maintainer gives his two cent.

If this is a “won’t fix” since this is a k8s issue, it would be nice if this is added in the documentation?

Kind regards,

phil.griffiths · January 13, 2022, 10:42am

Thanks for confirming it works. Essentially, this is a long standing kubernetes issue, not an AWX one per se. Until that is fixed, this problem will persist and the workaround will need to be applied. We may look at the code and decide to make a change, but that currently isn’t on the radar. FYI, in openshift this isn’t a problem, as we use a different logging mechanism, so this doesn’t affect our enterprise customers (as that’s the platform we support for AAP).

So your next question might be then why not just use the openshift codebase? Well that might be a longer term option, but would need to be assessed and factored into development, and as I say for now, it’s not on the radar unfortunately.

claudiomastrapasqua · January 15, 2022, 11:02pm

Hello guys,
i already followed the issue you’re talking about but in my scenario this does not help.
I tried in K3S (single-node installation) the settings below:
–kubelet-arg container-log-max-files=2
–kubelet-arg container-log-max-size=500Mi

Time ago i replied to an issue in AWX EE repo too 'cause i thought about a possible issue on runnner idle timeout.
You find here my use case: https://github.com/ansible/awx-ee/issues/80

My playbook (that include a role) appears failed always after 4 hrs.
Thank you

Best,
Claudio

Topic		Replies	Views
AWX job terminated unexpectedly AWX Project awx	22	26	September 25, 2023
AWX 19.4/19.5 Error in asyn task AWX Project awx , windows , kubernetes	6	4	February 22, 2022
Jobs not executing AWX Project awx	7	23	February 14, 2018
log aggregation not working after 10 minutes AWX Project awx , kubernetes	6	45	July 5, 2023
AWX Issue reporting results Get Help awx	6	72	August 16, 2024

AWX marks long running jobs in ERROR, though the job completes successfully

Related topics