Long start time for jobs

Hello everyone,

I’ve got a issue with jobs taking a “long” time to start, usually 40secs - 1 min 30 secs. We use EEs from a registry running on the same node as AWX. Changing the settings for pulling the image always/missing/never does not change the behavior. A update for AWX and scaling up the resources on the node has also been done without result.

When checking the different container logs from the awx pod, I cannot find any that points to any long running calls that would create this behavior. Running watch -n 1 kubectl -n awx get pods when starting a new job shows that the job pod is not created until the last 10 seconds before I get job output in the GUI.
Manually docker pull and docker run on the image from the registry runs within 10 seconds at least.

Where could I look further to isolate the issue log-wise (any supply any logs to you guys)? Or is this a normal behavior?

Best Regards,
André

Hi André,
when the job starts, is it in the ‘Pending’ state or the ‘Running’ state?

Hi Michael,

The job starts in the “Running” state directly when initiating the job, and “Waiting for job output” in the stdout job window.
If I start multiple jobs on the same template, the jobs after the first job are in the “Pending” state until it is their turn, and they directly go to the “Running” state also. The same issue occurs here with the long start times.

Best Regards,
André

måndag 5 juni 2023 kl. 10:47:51 UTC+2 skrev michae...@sohonet.com:

for debugging purposes, is it possible to just use quay.io/ansible/awx-ee:latest (the default ee) to run your jobs? This will help rule out the custom EE registry you have set up

AWX Team

Hi AWX Team,

I’m seeing the same results with the default AWX-EE. I’ve captured logs from the awx-task container in kubectl when testing with default EE, see the attached txt file (job 521).
I also noticed when starting the job that awx-dispatch pushed only one CPU core to the max, for the whole wait until the last 10 seconds.

Best regards,
André

onsdag 7 juni 2023 kl. 20:55:53 UTC+2 skrev AWX Project:

(attachments)

logs_awx-task_defaultee.txt (11.8 KB)

thanks for providing logs!

2023-06-08 12:24:14,254 INFO [13548cf736a94d578debd84485cae396] awx.analytics.job_lifecycle job-521 pre run
2023-06-08 12:24:16,112 INFO [13548cf736a94d578debd84485cae396] awx.analytics.job_lifecycle job-521 preparing playbook
2023-06-08 12:24:16,213 INFO [13548cf736a94d578debd84485cae396] awx.analytics.job_lifecycle job-521 running playbook

then a minute later

2023-06-08 12:25:30,054 INFO [13548cf736a94d578debd84485cae396] awx.analytics.job_lifecycle job-521 work unit id received
2023-06-08 12:25:30,084 INFO [13548cf736a94d578debd84485cae396] awx.analytics.job_lifecycle job-521 work unit id assigned

between those log statements, AWX is building up the private data dir. In this period of time it copies over project files, playbook variables, and inventory hosts file over to a temporary directory (located on /tmp). It then zips up this directory and submits it to receptor, and then receptor spins up the automation job pod and starts running the playbook.

Generally speaking, a copy of the project files are already stored on that node in the project folder (which can optionally be mounted in with persistent projects). So unless the control node is brand new, of the project persistence dir is being cleaned up for some reason, then the controller node should already have these files. Copying them over shouldn’t take too much time.

There are a couple of reasons why this part could be slow

  1. You are specifying an scm_branch/tag override for your job template or inventory source.

  2. The project is massive in size and takes a while to copy over.

What is the delay between “running playbook” log and “work unit id received” for the Demo Job Template? Run it a few times in a row and then report the times of the last run.

AWX Team

Hi again,

Well, that explanation did the trick. My projects folder was 1.2 gigs because of images files that were uploaded to network devices with the playbook. Not the prettiest solution and that created the problem…
After I removed the image files I tried again and now it took 2-3 seconds to start the job.

Thank you for all your help!

Kind regards,
André

fredag 9 juni 2023 kl. 00:21:56 UTC+2 skrev AWX Project:

you’re welcome!

AWX Team