Pod awx-operator-controller-manager crashing and restarting nonstop with huge inventory

Vivian · March 28, 2024, 6:50pm

Hi you all!

I’m facing an issue with AWX when executing parallel jobs (~25) and huge inventory (+1700 servers). At some point the pod awx-operator-controller-manager starts to crash and restarts with status “CrashLoopBackOff” and digging into some logs I can see the messages below:

Warning Unhealthy 5m43s (x52 over 21h) kubelet Liveness probe failed: Get “[…]:6789/healthz”: dial tcp 10.42.0.7:6789: connect: connection refused Warning Unhealthy 5m36s (x266 over 21h) kubelet Readiness probe failed: Get “[…]:6789/readyz”: dial tcp 10.42.0.7:6789: connect: connection refused Warning BackOff 22s (x132 over 21h) kubelet Back-off restarting failed container awx-manager in pod awx-operator-controller-manager-665cf85468-9d9br_awx

It makes the automation jobs fail and take a little bit longer than usual. Another thing I’ve noticed is that some servers ends with unreachable status because apparently AWX could not resolv the hostnames - which is weird as long as when I ran a template with the same server alone it runs succesfully. I think it’s related to the pod restarting during the process, maybe…

Anyone here have a clue about what can be done to solve the problem?

Here some more logs I collected:

→ kubectl -n awx logs -f awx-task-66675cbdf7-n47qq
E0328 15:08:48.804133 2808702 remote_runtime.go:432] “ContainerStatus from runtime service failed” err=“rpc error: code = DeadlineExceeded desc = context deadline exceeded” containerID=“e9e8552948469ccc6e3e6f38a2280bad8652404365ffdd2635a5bdbb97cb9996” FATA[0876] rpc error: code = DeadlineExceeded desc = context deadline exceeded

→ kubectl -n awx logs -f awx-operator-controller-manager-665cf85468-9d9br
E0328 17:50:53.639097 7 leaderelection.go:367] Failed to update lock: Put “https://10.43.0.1:443/api/v1/namespaces/awx/configmaps/awx-operator”: context deadline exceeded I0328 17:50:53.639211 7 leaderelection.go:283] failed to renew lease awx/awx-operator: timed out waiting for the condition {“level”:“error”,“ts”:“2024-03-28T17:50:53Z”,“logger”:“cmd”,“msg”:“Proxy or operator exited with error.”,“Namespace”:“awx”,“error”:“leader election lost”[…]

I’m running AWX - version 23.8.1 - under k3s.

Vivian · March 28, 2024, 6:55pm

Just to add more information here:

/usr/local/bin/kubectl NAME awx-postgres-13-0 automation-job-1862-kdqcz automation-job-1866-svttn automation-job-1861-r7qkl awx-web-5f8f49d58c-wqht9 automation-job-1863-gbzfs automation-job-1858-krf6p automation-job-1864-2prk5 automation-job-1868-vksp4 automation-job-1865-9g9k8 awx-task-66675cbdf7-n47qq automation-job-1867-q9zxf automation-job-1875-x6pz8 automation-job-1856-xhbdh automation-job-1857-gc6k2 awx-operator-controller-m automation-job-1873-wqlwx automation-job-1878-rb958 automation-job-1860-z4nxk automation-job-1872-xlq58 automation-job-1870-8jf9t automation-job-1871-9hmmh automation-job-1854-55ssf automation-job-1876-s5dkp automation-job-1855-n6xqz automation-job-1859-qfzcj automation-job-1869-bg2qv automation-job-1874-ngcmj automation-job-1877-5r7s4 -n awx get pod READY STATUS RESTARTS AGE 1/1 Running 10 (21h ago) 31d 1/1 Running 0 71m 1/1 Running 0 71m 1/1 Running 0 70m 3/3 Running 18 (21h ago) 28d 1/1 Running 0 70m 1/1 Running 0 71m 1/1 Running 0 70m 1/1 Running 0 70m 1/1 Running 0 70m 4/4 Running 24 (21h ago) 28d 0/1 Error 0 70m 0/1 Error 0 70m 0/1 Error 0 70m 0/1 Error 0 70m anager-665cf85468-9d9br 2/2 Running 56 (18m ago) 25h 0/1 Error 0 69m 0/1 Error 0 70m 0/1 Error 0 70m 0/1 Error 0 70m 0/1 Error 0 70m 0/1 Error 0 71m 0/1 Error 0 71m 0/1 Error 0 70m 0/1 Error 0 70m 0/1 Error 0 69m 0/1 Error 0 69m 0/1 Error 0 69m 0/1 Error 0 70m

Denney-tech · March 28, 2024, 7:26pm

Are you using job slicing to execute the jobs in parallel, or are you triggering the same job ~25 times with different inventories/limits?

Vivian · March 28, 2024, 7:38pm

Hi @Denney-tech! I’m using job slicing and lauching the template with the same inventory.

Denney-tech · March 28, 2024, 7:53pm

Okay. I was hoping maybe job slicing was a little less aggressive than that.

Anyways, I think you’re running into a k8s resources problem. Either your jobs aren’t getting provisioned with enough resources, or you’re running more jobs at once than your k8s cluster can handle. If either is the case, you could scale up your resource allocations, but you might also benefit from limitting how many concurrent jobs and forks can run in your Instance Groups.

Setting concurrent jobs to 8 and forks to 40, for e.g., would allow you to slice the job into 100 (if you felt like it), and 8 jobs would run at a time against 40/154 hosts per batch. It might take a while to churn through these, but it might be more reliable. You could even create a dedicated Instance Group and execution nodes just for large jobs like this.

You can tune this however you like, but I specifically chose 8 concurrent jobs since you only have 8 running job pods in your post.

Vivian · March 28, 2024, 8:20pm

Right, it makes sense! I’m just wondering - and sorry about that 'cause I’m not so familiar with it - how you end up with the correlation: 8 concurrent jobs to 40 forks? In fact I’m setting 30 concurrent jobs at job slicing, so how can I make the right correlation?

Denney-tech · March 28, 2024, 8:35pm

I chose 8 jobs for the number that you had in a running state. I chose 40 forks as 5 forks per job is the default (so 8x5=40). The size of your slices will impact the amount of RAM/storage might be needed per job, but the forks will impact how much CPU is needed. So higher slices means less resources needed per slice, and therefore you could adjust the concurrent jobs or forks higher. I don’t know any formula for you to precisely calculate what you should go for.

However, you could review how many forks are calculated by AWX for your instances:

and use the sum total as to determine an upper limit in your instance groups.

Vivian · March 28, 2024, 8:45pm

Get it! So I’ll try to setting this and run the template again with these new configurations. Hope it helps Anyways, I update this topic soon with the results. For now, I really appreciate your help Caleb!

Denney-tech · March 28, 2024, 8:56pm

So I just left work and can’t post screen shots or anything, but I think you’ll need to modify the default instance group, but that’s technically a container group IIRC, so idk if that has the same settings.

That or you need to specify the Instance Group you want to use in the job template.

Hope this helps!

Vivian · April 1, 2024, 7:07pm

Hello! I’ve set the parameters you mentioned and it really is more reliable. Now it’s running little by little, but it’s under control.
Of course I have to better understand how forks and concurrent jobs works to set an optimal value… however it is much better, no doubt!

Thanks very much, @Denney-tech!!!

Denney-tech · April 1, 2024, 7:12pm

Awesome! And it looks like the Container Group type Instance Groups does let you specify the max jobs/forks, but it also lets you adjust the resource specs for the pods created:

Just another place you can tweak the resources used by AWX.

TheRealHaoLiu · April 3, 2024, 6:54pm

yep, @Denney-tech is correct, from the operator log it seems like your kube-apiserver was on the struggle bus.

Thanks @Denney-tech

system · May 3, 2024, 6:54pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
AWX Jobs Constantly Failing AWX Project awx	25	46	August 10, 2022
AWX Crashes when launch 7 concurrent jobs AWX Project awx , ubuntu	5	13	March 13, 2023
Missing awx instance AWX Project awx	12	95	November 3, 2022
Concurrency Stress Test AWX Project awx	12	24	January 18, 2018
Jobs not executing AWX Project awx	7	23	February 14, 2018

Pod awx-operator-controller-manager crashing and restarting nonstop with huge inventory

Related topics