Pod awx-operator-controller-manager crashing and restarting nonstop with huge inventory

Okay. I was hoping maybe job slicing was a little less aggressive than that.

Anyways, I think you’re running into a k8s resources problem. Either your jobs aren’t getting provisioned with enough resources, or you’re running more jobs at once than your k8s cluster can handle. If either is the case, you could scale up your resource allocations, but you might also benefit from limitting how many concurrent jobs and forks can run in your Instance Groups.

Setting concurrent jobs to 8 and forks to 40, for e.g., would allow you to slice the job into 100 (if you felt like it), and 8 jobs would run at a time against 40/154 hosts per batch. It might take a while to churn through these, but it might be more reliable. You could even create a dedicated Instance Group and execution nodes just for large jobs like this.

You can tune this however you like, but I specifically chose 8 concurrent jobs since you only have 8 running job pods in your post.