Job fails intermittently on AWX (Kubernetes)

Hello All,

The jobs ran from AWX is failing intermitently, not much information in the logs either.

Hoping someone here would’ve seen it earlier and found a fix for it.

Let me tell my AWX setup,

Deployed awx (9.0) on Kubernetes cluster with external postgres db. Scaled two replicas and mapped it to instance groups.

awx-0 (instance group 1) & awx-1 (instance group 2)

When I’m submitting the job to either of the instance groups, it fails intermittently. Looks like the job isnt getting scheduled on a timely manner and it fails to execute.

Please see the error logs from the celery container.

Failed job logs:

Looks liek the issue is with the replicas, if i’m having a single/default tower instance group the jobs runs without any issues.

I’ve scheduled a job to be ran every hour on the default tower instance group, will share the results tomorrow.

I saw couple of threads with same issue but couldn’t see any solid solutions, Hopng for some to shed some light on this problem/a fix for it.

Thanks,
Vibin

Looks like the problem is exactly as expected ie; scaling replicas causes the jobs to fail on both instances.

Anyone had any luck in finding the root cause and fixing it. If not, probably we’ll have to open a bug.

Regards,
Vibin

Anyone had a chance to look at this issue/faced similar issues in the past/present.

My deployment is pending coz of this issue, we need to have the instance group functionality working before deployment.

Appreciate your earliest responses.

Regards,
Vibin

Hello All,

This has been fixed.

Issue was with Rabbitmq clustering failure caused by SELINUX. Made the container privileged and the issue got resolved.

Regards,
Vibin