Pod awx-operator-controller-manager crashing and restarting nonstop with huge inventory

Hi you all!

I’m facing an issue with AWX when executing parallel jobs (~25) and huge inventory (+1700 servers). At some point the pod awx-operator-controller-manager starts to crash and restarts with status “CrashLoopBackOff” and digging into some logs I can see the messages below:

Warning Unhealthy 5m43s (x52 over 21h) kubelet Liveness probe failed: Get “[…]:6789/healthz”: dial tcp connect: connection refused
Warning Unhealthy 5m36s (x266 over 21h) kubelet Readiness probe failed: Get “[…]:6789/readyz”: dial tcp connect: connection refused
Warning BackOff 22s (x132 over 21h) kubelet Back-off restarting failed container awx-manager in pod awx-operator-controller-manager-665cf85468-9d9br_awx

It makes the automation jobs fail and take a little bit longer than usual. Another thing I’ve noticed is that some servers ends with unreachable status because apparently AWX could not resolv the hostnames - which is weird as long as when I ran a template with the same server alone it runs succesfully. I think it’s related to the pod restarting during the process, maybe…

Anyone here have a clue about what can be done to solve the problem?

Here some more logs I collected:

→ kubectl -n awx logs -f awx-task-66675cbdf7-n47qq
E0328 15:08:48.804133 2808702 remote_runtime.go:432] “ContainerStatus from runtime service failed” err=“rpc error: code = DeadlineExceeded desc = context deadline exceeded” containerID=“e9e8552948469ccc6e3e6f38a2280bad8652404365ffdd2635a5bdbb97cb9996”
FATA[0876] rpc error: code = DeadlineExceeded desc = context deadline exceeded

→ kubectl -n awx logs -f awx-operator-controller-manager-665cf85468-9d9br
E0328 17:50:53.639097 7 leaderelection.go:367] Failed to update lock: Put “”: context deadline exceeded
I0328 17:50:53.639211 7 leaderelection.go:283] failed to renew lease awx/awx-operator: timed out waiting for the condition
{“level”:“error”,“ts”:“2024-03-28T17:50:53Z”,“logger”:“cmd”,“msg”:“Proxy or operator exited with error.”,“Namespace”:“awx”,“error”:“leader election lost”[…]

I’m running AWX - version 23.8.1 - under k3s.

Just to add more information here:

/usr/local/bin/kubectl -n awx get pod
awx-postgres-13-0 1/1 Running 10 (21h ago) 31d
automation-job-1862-kdqcz 1/1 Running 0 71m
automation-job-1866-svttn 1/1 Running 0 71m
automation-job-1861-r7qkl 1/1 Running 0 70m
awx-web-5f8f49d58c-wqht9 3/3 Running 18 (21h ago) 28d
automation-job-1863-gbzfs 1/1 Running 0 70m
automation-job-1858-krf6p 1/1 Running 0 71m
automation-job-1864-2prk5 1/1 Running 0 70m
automation-job-1868-vksp4 1/1 Running 0 70m
automation-job-1865-9g9k8 1/1 Running 0 70m
awx-task-66675cbdf7-n47qq 4/4 Running 24 (21h ago) 28d
automation-job-1867-q9zxf 0/1 Error 0 70m
automation-job-1875-x6pz8 0/1 Error 0 70m
automation-job-1856-xhbdh 0/1 Error 0 70m
automation-job-1857-gc6k2 0/1 Error 0 70m
awx-operator-controller-manager-665cf85468-9d9br 2/2 Running 56 (18m ago) 25h
automation-job-1873-wqlwx 0/1 Error 0 69m
automation-job-1878-rb958 0/1 Error 0 70m
automation-job-1860-z4nxk 0/1 Error 0 70m
automation-job-1872-xlq58 0/1 Error 0 70m
automation-job-1870-8jf9t 0/1 Error 0 70m
automation-job-1871-9hmmh 0/1 Error 0 71m
automation-job-1854-55ssf 0/1 Error 0 71m
automation-job-1876-s5dkp 0/1 Error 0 70m
automation-job-1855-n6xqz 0/1 Error 0 70m
automation-job-1859-qfzcj 0/1 Error 0 69m
automation-job-1869-bg2qv 0/1 Error 0 69m
automation-job-1874-ngcmj 0/1 Error 0 69m
automation-job-1877-5r7s4 0/1 Error 0 70m

Are you using job slicing to execute the jobs in parallel, or are you triggering the same job ~25 times with different inventories/limits?

Hi @Denney-tech! I’m using job slicing and lauching the template with the same inventory.

Okay. I was hoping maybe job slicing was a little less aggressive than that.

Anyways, I think you’re running into a k8s resources problem. Either your jobs aren’t getting provisioned with enough resources, or you’re running more jobs at once than your k8s cluster can handle. If either is the case, you could scale up your resource allocations, but you might also benefit from limitting how many concurrent jobs and forks can run in your Instance Groups.

Setting concurrent jobs to 8 and forks to 40, for e.g., would allow you to slice the job into 100 (if you felt like it), and 8 jobs would run at a time against 40/154 hosts per batch. It might take a while to churn through these, but it might be more reliable. You could even create a dedicated Instance Group and execution nodes just for large jobs like this.

You can tune this however you like, but I specifically chose 8 concurrent jobs since you only have 8 running job pods in your post.

Right, it makes sense! I’m just wondering - and sorry about that 'cause I’m not so familiar with it - how you end up with the correlation: 8 concurrent jobs to 40 forks? In fact I’m setting 30 concurrent jobs at job slicing, so how can I make the right correlation?

I chose 8 jobs for the number that you had in a running state. I chose 40 forks as 5 forks per job is the default (so 8x5=40). The size of your slices will impact the amount of RAM/storage might be needed per job, but the forks will impact how much CPU is needed. So higher slices means less resources needed per slice, and therefore you could adjust the concurrent jobs or forks higher. I don’t know any formula for you to precisely calculate what you should go for.

However, you could review how many forks are calculated by AWX for your instances:

and use the sum total as to determine an upper limit in your instance groups.

Get it! So I’ll try to setting this and run the template again with these new configurations. Hope it helps :wink: Anyways, I update this topic soon with the results. For now, I really appreciate your help Caleb!

1 Like

So I just left work and can’t post screen shots or anything, but I think you’ll need to modify the default instance group, but that’s technically a container group IIRC, so idk if that has the same settings.

That or you need to specify the Instance Group you want to use in the job template.

Hope this helps!

1 Like

Hello! :slight_smile: I’ve set the parameters you mentioned and it really is more reliable. Now it’s running little by little, but it’s under control.
Of course I have to better understand how forks and concurrent jobs works to set an optimal value… however it is much better, no doubt!

Thanks very much, @Denney-tech!!!

1 Like

Awesome! And it looks like the Container Group type Instance Groups does let you specify the max jobs/forks, but it also lets you adjust the resource specs for the pods created:

Just another place you can tweak the resources used by AWX.

1 Like

yep, @Denney-tech is correct, from the operator log it seems like your kube-apiserver was on the struggle bus.

Thanks @Denney-tech