Hello,
When running jobs in task containers, our k8s cluster runs on EKS with autoscaler.
We are running 100 jobs on the cluster all jobs require on average between 22-25 minutes to complete.
Currently my replica set for the AWX deployment is set to 3
AWX configuration of note:
SYSTEM_TASK_ABS_CPU = 12 # ← This was auto-calculated by the installation script
SYSTEM_TASK_ABS_MEM = 61 # ← This was auto-calculated by the installation script
AWX_CONTAINER_GROUP_K8S_API_TIMEOUT = 30 AWX_CONTAINER_GROUP_POD_LAUNCH_RETRIES = 300 AWX_CONTAINER_GROUP_POD_LAUNCH_RETRY_DELAY = 30
There is more than enough time for all jobs to start, but what I’ve noticed is that all the tasks are ran through only one task container, and that container increases in memory until it crashes or if by any chance there is enough memory the jobs finish correctly.
With 100 jobs I’ve noticed that the only one task container goes to 12GB of memory and if it runs out all the jobs crash afterwards.
The containers limits are
Web:
-
Memory [1, 2]
-
CPU [1, 1.5]
Task: -
Memory [6, 12]
-
CPU [3, 6]
Redis: -
Memory [0.5, 2]
-
CPU: [0.5,1.5]
I confirmed that all the jobs were created from different web API container, as I can see in the logs that the POST for the new job is randomised between all the containers.
I’ve also tried using one Redis (ElastiCache), but then it’s the same issue only one container picks up the jobs and if it runs out of memory it crashes, and there is also some other weirdness happening at that time.
Does anyone know if I’m missing something obvious, or know what I need to change in the code to make it work between multiple task containers.
Is there any other information needed?