I’m facing an issue with AWX when executing parallel jobs (~25) and huge inventory (+1700 servers). At some point the pod awx-operator-controller-manager starts to crash and restarts with status “CrashLoopBackOff” and digging into some logs I can see the messages below:
Warning Unhealthy 5m43s (x52 over 21h) kubelet Liveness probe failed: Get “[…]:6789/healthz”: dial tcp 10.42.0.7:6789: connect: connection refused
Warning Unhealthy 5m36s (x266 over 21h) kubelet Readiness probe failed: Get “[…]:6789/readyz”: dial tcp 10.42.0.7:6789: connect: connection refused
Warning BackOff 22s (x132 over 21h) kubelet Back-off restarting failed container awx-manager in pod awx-operator-controller-manager-665cf85468-9d9br_awx
It makes the automation jobs fail and take a little bit longer than usual. Another thing I’ve noticed is that some servers ends with unreachable status because apparently AWX could not resolv the hostnames - which is weird as long as when I ran a template with the same server alone it runs succesfully. I think it’s related to the pod restarting during the process, maybe…
Anyone here have a clue about what can be done to solve the problem?
Okay. I was hoping maybe job slicing was a little less aggressive than that.
Anyways, I think you’re running into a k8s resources problem. Either your jobs aren’t getting provisioned with enough resources, or you’re running more jobs at once than your k8s cluster can handle. If either is the case, you could scale up your resource allocations, but you might also benefit from limitting how many concurrent jobs and forks can run in your Instance Groups.
Setting concurrent jobs to 8 and forks to 40, for e.g., would allow you to slice the job into 100 (if you felt like it), and 8 jobs would run at a time against 40/154 hosts per batch. It might take a while to churn through these, but it might be more reliable. You could even create a dedicated Instance Group and execution nodes just for large jobs like this.
You can tune this however you like, but I specifically chose 8 concurrent jobs since you only have 8 running job pods in your post.
Right, it makes sense! I’m just wondering - and sorry about that 'cause I’m not so familiar with it - how you end up with the correlation: 8 concurrent jobs to 40 forks? In fact I’m setting 30 concurrent jobs at job slicing, so how can I make the right correlation?
I chose 8 jobs for the number that you had in a running state. I chose 40 forks as 5 forks per job is the default (so 8x5=40). The size of your slices will impact the amount of RAM/storage might be needed per job, but the forks will impact how much CPU is needed. So higher slices means less resources needed per slice, and therefore you could adjust the concurrent jobs or forks higher. I don’t know any formula for you to precisely calculate what you should go for.
However, you could review how many forks are calculated by AWX for your instances:
Get it! So I’ll try to setting this and run the template again with these new configurations. Hope it helps Anyways, I update this topic soon with the results. For now, I really appreciate your help Caleb!
So I just left work and can’t post screen shots or anything, but I think you’ll need to modify the default instance group, but that’s technically a container group IIRC, so idk if that has the same settings.
That or you need to specify the Instance Group you want to use in the job template.
Hello! I’ve set the parameters you mentioned and it really is more reliable. Now it’s running little by little, but it’s under control.
Of course I have to better understand how forks and concurrent jobs works to set an optimal value… however it is much better, no doubt!
Awesome! And it looks like the Container Group type Instance Groups does let you specify the max jobs/forks, but it also lets you adjust the resource specs for the pods created: