Spinning AWX pods in AKS cluster with Azure CNI overlay mode

We recently observed an issue with AWX(21.4) which is deployed in our AKS cluster which is provisioned with Azure CNI with overlay mode. On the AWX UI, we observed that the jobs were stuck in waiting state. Usually in this version of AWX, the jobs go from pending, waiting to running state, before going to the end states of successful, failed, canceled, errored.

What we observed was that all the jobs were stuck in waiting state.

  1. When we debugged, we observed that we had incorrectly configured one of the instance group, so we fixed that as the inventory job was going to this instance group and that job was in waiting state blocking all the jobs after that.
  2. We were still facing the issue sometimes when we received a large amount of jobs to AWX. The jobs were stuck in waiting state, and the restarts did not help, as it went to waiting state in sometime. We saw that the jobs were getting spun up a bit longer with Azure CNI with overlay mode than with kubenet mode. So we changed this configuration AWX_CONTAINER_GROUP_K8S_API_TIMEOUT from 10 seconds to 40 seconds. This fixed the issue.

This is a post for other community members who may encounter such issues with AWX in AKS(Azure CNI with overlay mode)

Thanks