How to troubleshoot all AWX jobs stuck in pending state?

Hi all, we are running AWX 11.1 in a Kubernetes cluster with ten nodes. Occasionally AWX gets into a state where no jobs are being run, and all new jobs are stuck in ‘pending’ state. This has been ongoing since earlier versions of AWX.

We have spent a fair amount of time trying to figure out what’s going on in these cases, but it’s never very clear. There are typically no clues (to our eyes) in the pods awx-task logs. We usually see a lot of “awx.main.scheduler Not running scheduler, another task holds lock” messages, but this seems normal for the nodes that do not hold the lock.

We have noticed during at least a few of these cases, that the earliest not-running job was a project update – and all of our playbooks are contained in this project.

We’ve also dug into the database looking for obvious signs of stale locks, but aren’t really sure what to expect here or what to look for.

Usually, we end up using awx-manage shell_plus to iterate over all of the pending jobs and canceling each one, and this usually frees things up (although it seems to take maybe 10 or 15 minutes after this completes before AWX is operational again).

Is there a way (db query, API call, or something else) to clearly see what is blocking AWX or what state the ‘executor’ is in? Even just knowing which node/pod has the lock would be useful.

Thanks,

Hi John,

you should upgrade to AWX 12.0.0 or newer. AWX 11.1 had some issues with blocking events.

Best regards
Stefan