AWX Skipping Scheduled Jobs

We’ve been running into a regular issue with our AWX environment where jobs get hung in a pending state. The fix has been to bounce the AKS pods, but so far we haven’t had luck getting to a root cause.

In the meantime, we’ve added a smoke monitor to run every 15 minutes in the hopes of catching the pending jobs issue before an AWX customer job gets stuck. This past week, we saw a gap in the schedule of the smoke monitor runs. There didn’t appear to be anything else wrong with the environment.

You can see that between 5pm and 7:19pm, it didn’t run even though the schedule was for every 15 minutes.

I’ve just recently taken over technical ownership of AWX from people who have left the company, so help at a basic level is likely what I need even just to understand where to look. Many thanks for your assistance!

I don’t know what your issue is, but if you’re doing monitoring, I would like to suggest some items to collect. Try this ran via awx-manage shell_plus:

from awx.main.tasks.system import TowerScheduleState, now

[(now() - row[0]).total_seconds() for row in TowerScheduleState.objects.values_list('schedule_last_run')]

This should give a list with 1 element (any more would be a problem) with the time in seconds of the last recorded scheduler run. If this goes into arrears (significantly over 30 seconds) then that means this task which spawns the scheduled jobs either isn’t running or is throwing errors.

Also check

awx-manage run_dispatcher --schedule

And look for “tower_scheduler” in the output. This will reflect the history of starting the relevant task.

Lately, I’ve noticed that I’ll accumulate some “missed_runs” in the --schedule output I mentioned. This is because I was suspending my computer. Doing this somewhat intentionally, I found these logs:

tools_awx_1 | awx-dispatcher stderr | 2023-10-24 20:13:21,248 WARNING  [-] awx.main.dispatch.periodic Missed 13 schedules of tower_scheduler
tools_awx_1 | awx-dispatcher stderr | 
tools_awx_1 | awx-dispatcher stderr | 2023-10-24 20:13:21,248 WARNING  [-] awx.main.dispatch.periodic Missed 6 schedules of cluster_heartbeat
tools_awx_1 | awx-dispatcher stderr | 2023-10-24 20:13:21,248 WARNING  [-] awx.main.dispatch.periodic Missed 20 schedules of task_manager
tools_awx_1 | awx-dispatcher stderr | 2023-10-24 20:13:21,248 WARNING  [-] awx.main.dispatch.periodic Missed 20 schedules of dependency_manager
tools_awx_1 | awx-dispatcher stderr | 2023-10-24 20:13:21,248 WARNING  [-] awx.main.dispatch.periodic Missed 6 schedules of k8s_reaper
tools_awx_1 | awx-dispatcher stderr | 
tools_awx_1 | awx-dispatcher stderr | 2023-10-24 20:13:21,248 WARNING  [-] awx.main.dispatch.periodic Missed 6 schedules of receptor_reaper
tools_awx_1 | awx-dispatcher stderr | 2023-10-24 20:13:21,249 WARNING  [-] awx.main.dispatch.periodic Missed 20 schedules of send_subsystem_metrics
tools_awx_1 | awx-dispatcher stderr | 
tools_awx_1 | awx-dispatcher stderr | 2023-10-24 20:13:21,249 WARNING  [-] awx.main.dispatch.periodic Missed 6 schedules of pool_cleanup
tools_awx_1 | awx-dispatcher stderr | 2023-10-24 20:13:21,249 WARNING  [-] awx.main.dispatch.periodic Missed 19 schedules of metrics_gather
tools_awx_1 | awx-dispatcher stderr | 
tools_awx_1 | awx-dispatcher stderr | 2023-10-24 20:13:21,304 WARNING  [-] awx.main.tasks.system Rejoining the cluster as instance awx_1. Prior last_seen 2023-10-24 20:05:32.213804+00:00

And from the --schedule output:

  tower_scheduler:
    last_run_seconds_ago: 11.357
    next_run_in_seconds: 18.609
    offset_in_seconds: 0
    completed_runs: 8
    missed_runs: 13

This shows that those diagnostic tools are doing their job in the way I intended them to. Suspending a machine is not at all the same as shutting down. To the program, it’s like a time travel to the future or a coma. It doesn’t run the shutdown or startup callbacks. The OS just didn’t let it run for a time window, and it logically missed schedules.

Thank you for these suggestions! We haven’t been intentionally doing anything during the missed runs, but I have a better sense of what to look for in the logs now.

It also seems that some of our concerns may be remediated with updates we haven’t deployed yet, so an upgrade is being scheduled.