Job fails intermittently on AWX (Kubernetes)

Vibin · December 20, 2019, 3:56pm

Hello All,

The jobs ran from AWX is failing intermitently, not much information in the logs either.

Hoping someone here would’ve seen it earlier and found a fix for it.

Let me tell my AWX setup,

Deployed awx (9.0) on Kubernetes cluster with external postgres db. Scaled two replicas and mapped it to instance groups.

awx-0 (instance group 1) & awx-1 (instance group 2)

When I’m submitting the job to either of the instance groups, it fails intermittently. Looks like the job isnt getting scheduled on a timely manner and it fails to execute.

Please see the error logs from the celery container.

Failed job logs:

Vibin · December 20, 2019, 4:20pm

Looks liek the issue is with the replicas, if i’m having a single/default tower instance group the jobs runs without any issues.

I’ve scheduled a job to be ran every hour on the default tower instance group, will share the results tomorrow.

I saw couple of threads with same issue but couldn’t see any solid solutions, Hopng for some to shed some light on this problem/a fix for it.

Thanks,
Vibin

Vibin · December 21, 2019, 4:29pm

Looks like the problem is exactly as expected ie; scaling replicas causes the jobs to fail on both instances.

Anyone had any luck in finding the root cause and fixing it. If not, probably we’ll have to open a bug.

Regards,
Vibin

Vibin · January 8, 2020, 12:37pm

Anyone had a chance to look at this issue/faced similar issues in the past/present.

My deployment is pending coz of this issue, we need to have the instance group functionality working before deployment.

Appreciate your earliest responses.

Regards,
Vibin

Vibin · January 10, 2020, 9:53am

Hello All,

This has been fixed.

Issue was with Rabbitmq clustering failure caused by SELINUX. Made the container privileged and the issue got resolved.

Regards,
Vibin

Topic		Replies	Views
AWX kubernetes deployment having several group instances don't run jobs on an specific instance... AWX Project awx , kubernetes	3	24	December 20, 2019
awx-project@googlegroups.com AWX Project awx	2	0	April 26, 2021
AWX Cluster installed on 3 nodes but the job fails when it runs from the 2nd or 3rd node AWX Project awx	5	10	September 23, 2019
AWX Kubernetes AWX Project awx , kubernetes	10	23	November 15, 2019
AWX Jobs Constantly Failing AWX Project awx	25	30	August 10, 2022

Job fails intermittently on AWX (Kubernetes)

Related topics