This seems to be a new thing I noticed recently (pretty sure it was working correctly in the past).
In short while using slices all automation pods get scheduled on a single worker node. There are 4 in the cluster. AWX itself (operator) is distributed on 2.
Is there anything I missing in the AWX config or on K8S side? Ideally I’d like an even distribution or some load balancing within a cluster to define which worker should be used.
I’m not using any afinity rules on K8S side or any other configuration that would effect scheduling. Job has 8 slices and 200 works running agains ~6000+ systems.
Any insight into this would be hugely appreciated.
Thank you for the resource but I have to admit I’m bit lost as to how this relates to my issues.
Is default behaviour not to balance the execution between nodes in the cluster? Is there additional configuration required to achieve this? Is that configuration is done on EE level?
As mentioned already in the past I saw automation-jobs being created on multiple workers while slicing.
I have not been using latest EE for past few days since there have been some issues with missing python dependencies so fell back on 21.11.0 for time being. Even though claim is that this has been fixed attempts to use the latest EE still result in the same error as before. Would using this particular EE would be in any way related to the issue with scheduling jobs on multiple workers?
I have tested with EE-latest again as seem it is now fix and this time around jobs got spread across the cluster as expected.
This would suggest your were dead on with the route cause. I still done understand though how that is so again would appreciate it more in depth explanation or some additional resources I can reference.
which nodes the automation job pods end up is not determined by AWX, but by the k8s cluster itself. You may look into adding affinity attributes to the EE pod spec, which may result in a desired distribution of job pods when launched.
Is it then fair to assume the latest EE has necessary modification in place already to work correctly in term of pod distributions while its older versions don’t? This is to say that the cluster knows how to behave based on some identifier assigned to EE-latest?
Is it possible to copy/replicate the behaviour of EE-latest on older or custom EEs in simple way (not a k8s guru )