Pod awx-operator-controller-manager crashing and restarting nonstop with huge inventory

Denney-tech · March 28, 2024, 7:53pm

Okay. I was hoping maybe job slicing was a little less aggressive than that.

Anyways, I think you’re running into a k8s resources problem. Either your jobs aren’t getting provisioned with enough resources, or you’re running more jobs at once than your k8s cluster can handle. If either is the case, you could scale up your resource allocations, but you might also benefit from limitting how many concurrent jobs and forks can run in your Instance Groups.

Setting concurrent jobs to 8 and forks to 40, for e.g., would allow you to slice the job into 100 (if you felt like it), and 8 jobs would run at a time against 40/154 hosts per batch. It might take a while to churn through these, but it might be more reliable. You could even create a dedicated Instance Group and execution nodes just for large jobs like this.

You can tune this however you like, but I specifically chose 8 concurrent jobs since you only have 8 running job pods in your post.

Topic		Replies	Views
Kubernetes CrashLoopBackOff AWX Project awx , kubernetes	3	70	March 13, 2022
awx deployment failed in k3s cluster AWX Project awx , kubernetes , ee	5	20	January 13, 2023
Missing awx instance AWX Project awx	12	53	November 3, 2022
AWX Jobs Constantly Failing AWX Project awx	25	27	August 10, 2022
awx-manager not starting up awx pod AWX Project awx , kubernetes	4	36	March 17, 2023

Pod awx-operator-controller-manager crashing and restarting nonstop with huge inventory

Related topics