Job Slicing AWX Operator

Curious how Job Slicing works in the AWX Operator. When running on many resources i’d expect jobs to line up in a pending status on the k8s environment but it appears that it doesnt do that.

Goal is with heavy load i’d like to see the EKS cluster autoscale up with karpentor whee the nodes runs out of resources vs AWX limiting it.

Thoughts?

The AWX-Operator only controls how many web/task replicas there are (task replicas are also known as control nodes and are members of the controlplane Instance Group in AWX). It doesn’t support HPA yet, but there’s been some discussions about how to implement it and test such a feature.

Job slicing on the other hand has nothing to do with the operator. When you queue up a job with multiple slices, AWX’s control/execution nodes spawn the automation-job pods (not the operator!) up to the maximum concurrent job count defined for all available Instance Groups (default is unlimited!).

I went over how you can tune that here.

I don’t know anything about EKS or karpentor, but if the k8s cluster can scale up to handle a high number of automation pods, then I don’t see any reason why you couldn’t take advantage of that. You would just need to make sure AWX’s Instance Groups only spawn a maximum count that coincides with whatever maximum you allow the k8s cluster to scale to.

Hey thanks for the quick reply!

Apologies I guess that was poor explanation on my part. I realize the operator doesn’t spawn the automation jobs. I guess Im more curious why I’m not seeing a lot more automation jobs get spawned when I run a job with say 50 slices. The workflow is created and I see many jobs “running” from the UI but under the hood only 5-8 automation jobs are being spawned at a time.
All instance groups settings are default so ‘unlimited’ should be set and I would expect essentially all 50 pods be created at nearly the same time?

Thanks again.

All good, just wanted to make sure we’re both on the same page on what does what, and what you’re wanting to do verses what behavior you’re observing.

Yes, I would expect that as well. That’s what was causing @vivian’s cluster to OOM-kill most of the pods because the k8s cluster didn’t have the resources or auto-scalability to handle +30 job slices at once.

I’m guessing we may need to take a closer look at how you’re slicing the jobs and maybe what the workflow looks like.

Are the maximum forks also default/unlimited? I don’t think that affects the maximum number of pods, but I’m not sure.

Do you see ~50 jobs queued up in AWX, most as pending, when you run your template/workflow with 50 slices?

Have you tried toggling the “enable concurrent jobs” in the templates you’re slicing? I don’t actually use job slicing myself, so Idk if that option is necessary for running job slices in parallel.

1 Like

For reference right now were just testing how it behaves so the actual playbook is just a debug, wait, debug. This should allow all the jobs to spawn together.

According to the docs when you enable job slicing AWX will generate a workflow from a job template instead of a job. When I launch it I do see ~50 jobs launch in the UI, all are in a status “running” but with no output. Then they slowly start completing 1 by 1. In k8s I can see automation-jobs being spun up slowly roughly 5 at a time but as one completes another is spawned.

Yes, Max forks and Max concurrent are default on the instance group which is just the ‘default’ one. In the job template we have slicing set to 50 and forks set to 0. Note we’ve moved all of these numbers all around forks to 50 slicing to 50, back to zero all sort of different combinations but can seem to get more then 5 pods to run at the same time. Maybe slicing does some other limiting, Im not really sure.

Note we are on AWX v24.2.0

That sounds like they’re not allowed to run concurrently then.

You need to enable it in the job template at the bottom of the page:


Thats the more fun part, we’ve done it with both but according to the docs when using slicing that doesnt matter and should override it to run concurrently even if its not checked.

@Vivian You technically have more experience with job slicing than I do now, would you have any idea why @beda392 would have the opposite problem you had? lol

Sorry, unless you’re trying to slice more jobs than you have inventory_hosts, I’m out of ideas.

No worries, I appreciate all the ideas!