In 4 worker node cluster all automation jobs get scheduled on single worker while job slicing

sinnl · February 10, 2023, 9:29am

Hi,

This seems to be a new thing I noticed recently (pretty sure it was working correctly in the past).

In short while using slices all automation pods get scheduled on a single worker node. There are 4 in the cluster. AWX itself (operator) is distributed on 2.

Is there anything I missing in the AWX config or on K8S side? Ideally I’d like an even distribution or some load balancing within a cluster to define which worker should be used.

I’m not using any afinity rules on K8S side or any other configuration that would effect scheduling. Job has 8 slices and 200 works running agains ~6000+ systems.

Any insight into this would be hugely appreciated.

L.

AWX_Project · February 10, 2023, 6:48pm

It should be possible to edit the pod spec for container groups:

https://docs.ansible.com/automation-controller/latest/html/userguide/execution_environments.html

I’d suggest starting with that documentation.

Thanks,

AWX Team

sinnl · February 11, 2023, 12:26pm

Hi,

Thank you for the resource but I have to admit I’m bit lost as to how this relates to my issues.

Is default behaviour not to balance the execution between nodes in the cluster? Is there additional configuration required to achieve this? Is that configuration is done on EE level?

As mentioned already in the past I saw automation-jobs being created on multiple workers while slicing.

I have not been using latest EE for past few days since there have been some issues with missing python dependencies so fell back on 21.11.0 for time being. Even though claim is that this has been fixed attempts to use the latest EE still result in the same error as before. Would using this particular EE would be in any way related to the issue with scheduling jobs on multiple workers?

Many thanks for your input and advice.

L.

sinnl · February 11, 2023, 1:16pm

I have tested with EE-latest again as seem it is now fix and this time around jobs got spread across the cluster as expected.

This would suggest your were dead on with the route cause. I still done understand though how that is so again would appreciate it more in depth explanation or some additional resources I can reference.

Thank you,
L.

AWX_Project · February 15, 2023, 7:39pm

which nodes the automation job pods end up is not determined by AWX, but by the k8s cluster itself. You may look into adding affinity attributes to the EE pod spec, which may result in a desired distribution of job pods when launched.

I would start by reading this https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ and then modifying the pod spec in the EE

Thanks,

AWX Team

sinnl · February 18, 2023, 11:54pm

Once again, thank you very much for your reply.

Is it then fair to assume the latest EE has necessary modification in place already to work correctly in term of pod distributions while its older versions don’t? This is to say that the cluster knows how to behave based on some identifier assigned to EE-latest?

Is it possible to copy/replicate the behaviour of EE-latest on older or custom EEs in simple way (not a k8s guru )

Thank you,
L.

AWX_Project · February 22, 2023, 7:41pm

Hi,

Thanks for your question. Nothing about the awx-ee image itself will have a bearing on where the pods will run within the k8s cluster

AWX Team

Topic		Replies	Views
Job Slicing AWX Operator Project Discussions awx , awx-operator , kubernetes	8	718	May 2, 2024
AWX Kubernetes AWX Project awx , kubernetes	10	18	November 15, 2019
AWX Instance groups Question AWX Project awx	4	6	January 21, 2019
Scheduler not working on AWX Ansible Project awx , kubernetes	1	10	July 5, 2021
Control nodes in AWX job execution Get Help awx , kubernetes	3	133	February 5, 2025

In 4 worker node cluster all automation jobs get scheduled on single worker while job slicing

Related topics