AWX 24.6.1
Two separate k8s clusters, one for my AWX instance group and one for my container group. Currently running 1 web and 1 task container for testing. Container group is 10 on-prem k8s nodes with tons of dedicated CPU/MEM. AWX capacity says I’m at 8% at max on my instance group and each node should run 100+ jobs without issue per the math from the white paper. I created a test job template that pauses for 60 minutes and I can launch/run 50 (per my config) jobs at a time if the jobs run long enough to overlap.
I have a basic template that takes 30 seconds to run and I may make ~500 API calls to AWX to run said job template when running a report. AWX will run my 500 jobs but it will only start them 1 at a time and only about 1 job every 30 seconds. The result is 450 pending jobs, 50 running, but only 1-5 of the jobs is actually running at any given time. So instead of burning through 500 jobs in a few minutes it generally takes 3-4 hours to complete all of the queued jobs, similar if they were queued to only run one after another.
Is this a limitation on my k8s cluster or some setting in AWX? I would expect AWX to fire batch API calls to my k8s cluster and be done with it but AWX appears to queue jobs one at a time when sending to the container group. With my 30 second test job the most I’ve ever seen running (via kubectl) simultaneously is ~5 jobs at a time and there isn’t anything “waiting” on my k8s cluster.
Thanks in advance!