We’ve been running into a regular issue with our AWX environment where jobs get hung in a pending state. The fix has been to bounce the AKS pods, but so far we haven’t had luck getting to a root cause.
In the meantime, we’ve added a smoke monitor to run every 15 minutes in the hopes of catching the pending jobs issue before an AWX customer job gets stuck. This past week, we saw a gap in the schedule of the smoke monitor runs. There didn’t appear to be anything else wrong with the environment.
You can see that between 5pm and 7:19pm, it didn’t run even though the schedule was for every 15 minutes.
I’ve just recently taken over technical ownership of AWX from people who have left the company, so help at a basic level is likely what I need even just to understand where to look. Many thanks for your assistance!
I don’t know what your issue is, but if you’re doing monitoring, I would like to suggest some items to collect. Try this ran via awx-manage shell_plus:
from awx.main.tasks.system import TowerScheduleState, now
[(now() - row[0]).total_seconds() for row in TowerScheduleState.objects.values_list('schedule_last_run')]
This should give a list with 1 element (any more would be a problem) with the time in seconds of the last recorded scheduler run. If this goes into arrears (significantly over 30 seconds) then that means this task which spawns the scheduled jobs either isn’t running or is throwing errors.
Also check
awx-manage run_dispatcher --schedule
And look for “tower_scheduler” in the output. This will reflect the history of starting the relevant task.
Lately, I’ve noticed that I’ll accumulate some “missed_runs” in the --schedule output I mentioned. This is because I was suspending my computer. Doing this somewhat intentionally, I found these logs:
This shows that those diagnostic tools are doing their job in the way I intended them to. Suspending a machine is not at all the same as shutting down. To the program, it’s like a time travel to the future or a coma. It doesn’t run the shutdown or startup callbacks. The OS just didn’t let it run for a time window, and it logically missed schedules.
Thank you for these suggestions! We haven’t been intentionally doing anything during the missed runs, but I have a better sense of what to look for in the logs now.
It also seems that some of our concerns may be remediated with updates we haven’t deployed yet, so an upgrade is being scheduled.