Error Locating Unit: `K5Sp0VXg` - Unknown Work Unit in AWX/Ansible Tower

Hi Ansible Community,

I’m encountering an issue in my AWX/Ansible Tower environment and would appreciate any insights or guidance on how to resolve it.

Error Details

The following error appears in the logs:

ERROR 2025/02/06 15:12:00 Error locating unit: K5Sp0VXg
ERROR 2025/02/06 15:12:00 unknown work unit K5Sp0VXg

This error occurs when the system is unable to locate a specific work unit (K5Sp0VXg). It seems to be related to task management, but I’m unsure of the root cause.

Environment Details

  • AWX Version: [22.7
  • Deployment Method: [AKS.]
  • Database: PostgreSQL
  • Logs: No additional errors in the task or web pod logs.

Steps Taken So Far

  1. Checked the status of AWX task pods – all are running without issues.
  2. Searched the database for the work unit K5Sp0VXg:
    SELECT * FROM main_unifiedjob WHERE uuid = 'K5Sp0VXg';
    
    The query returned no results, indicating the work unit is missing.
  3. Verified task synchronization – the task was submitted via the AWX API, but it seems it wasn’t recorded in the database.
  4. Restarted AWX task pods to clear any transient issues.

Questions

  1. I have two worker node and this is happening on only one worker node.
  2. What could cause a work unit to go missing in the database?
  3. Are there known issues with task synchronization in AWX/Ansible Tower?
  4. How can I prevent this issue from recurring?
  5. Is there a way to recover or recreate the missing work unit without disrupting the system?

Additional Context

  • This issue occurs intermittently, and I’ve noticed similar errors for other work units (e.g., wMvEP6LC, 9tzJOrvg).
  • The system is configured to automatically clean up completed tasks after 30 days.

Any help or suggestions would be greatly appreciated!

Thanks in advance,

Regards,
Manish Singh

Any input please on above request.

Looking for already finished work units?

We see the same thing, there also seems to be a related issue receptor#758.

Even though jobs run without error, receptor on Execution Node strangely logs every work unit with Error locating unit and unknown after the resp. job has finished. (See example below)

What could be the reason for this?

Is it possible receptor tries to receptorctl work release after jobs finish and fails because of something like ansible-runner --delete/podman run --rm?

Our setup

  • AWX 24.6.1 (Openshift, PostgreSQL) + Execution Nodes
  • Same receptor version on awx-task instances and Execution Nodes
    • receptor on ENs from GitHub release, hence *v*1.5.3):
    # receptorctl --socket /var/run/receptor/receptor.sock version
    Warning: receptorctl and receptor are different versions, they may not be compatible
    receptorctl  1.5.3
    receptor     v1.5.3
    
  • Custom EE image with
    ENTRYPOINT ["/opt/builder/bin/entrypoint", "dumb-init"]
    CMD ["ansible-runner", "worker", "--private-data-dir=/runner"]
    
    

awx-task instance: Log of job

2025-03-25 16:49:12,331 INFO     [769d7fc23c3d40bcbfb3174366dc93fa] awx.analytics.job_lifecycle job-221782 pre run {"type": "job", "task_id": 221782, "state": "pre_run", "work_unit_id": null, "task_name": "awx-tools/sleep"}
2025-03-25 16:49:13,315 INFO     [769d7fc23c3d40bcbfb3174366dc93fa] awx.analytics.job_lifecycle job-221782 preparing playbook {"type": "job", "task_id": 221782, "state": "preparing_playbook", "work_unit_id": null, "task_name": "awx-tools/sleep"}
2025-03-25 16:49:13,433 INFO     [769d7fc23c3d40bcbfb3174366dc93fa] awx.analytics.job_lifecycle job-221782 running playbook {"type": "job", "task_id": 221782, "state": "running_playbook", "work_unit_id": null, "task_name": "awx-tools/sleep"}
2025-03-25 16:49:13,985 INFO     [769d7fc23c3d40bcbfb3174366dc93fa] awx.analytics.job_lifecycle job-221782 work unit id received {"type": "job", "task_id": 221782, "state": "work_unit_id_received", "work_unit_id": "awxtask5bf78fc7b74s29djGTsAhAv", "task_name": "awx-tools/sleep"}
2025-03-25 16:49:14,074 INFO     [769d7fc23c3d40bcbfb3174366dc93fa] awx.analytics.job_lifecycle job-221782 work unit id assigned {"type": "job", "task_id": 221782, "state": "work_unit_id_assigned", "work_unit_id": "awxtask5bf78fc7b74s29djGTsAhAv", "task_name": "awx-tools/sleep"}
2025-03-25 16:49:38,693 INFO     [769d7fc23c3d40bcbfb3174366dc93fa] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 221782
2025-03-25 16:49:38,744 INFO     [769d7fc23c3d40bcbfb3174366dc93fa] awx.analytics.job_lifecycle job-221782 post run {"type": "job", "task_id": 221782, "state": "post_run", "work_unit_id": "awxtask5bf78fc7b74s29djGTsAhAv", "task_name": "awx-tools/sleep"}
2025-03-25 16:49:38,984 INFO     [769d7fc23c3d40bcbfb3174366dc93fa] awx.analytics.job_lifecycle job-221782 finalize run {"type": "job", "task_id": 221782, "state": "finalize_run", "work_unit_id": "awxtask5bf78fc7b74s29djGTsAhAv", "task_name": "awx-tools/sleep"}
2025-03-25 16:49:39,467 INFO     [-] awx.analytics.job_lifecycle job-221782 stats wrapup finished {"type": "job", "task_id": 221782, "state": "stats_wrapup_finished", "work_unit_id": "awxtask5bf78fc7b74s29djGTsAhAv", "task_name": "awx-tools/sleep"}

Execution Node: work unit

On Execution Node when looking for work units before/while processing they show up as expected:

# receptorctl --socket /var/run/receptor/receptor.sock work list
Warning: receptorctl and receptor are different versions, they may not be compatible
{
    "awxtask5bf78fc7b74s29djGTsAhAv": {
        "Detail": "Running: PID 845686",
        "ExtraData": {
            "Params": "worker --private-data-dir=/opt/awx/awx_tmp/awx_221782_54ztx1uq --delete",
            "Pid": 845678
        },
        "State": 1,
        "StateName": "Running",
        "StdoutSize": 11094,
        "WorkType": "ansible-runner"
    }
}

Execution Node: receptor.log

On a Execution Node every work unit seems to be logged as ‘unknown’ after it has been processed like so:

# tail /var/log/receptor/receptor.log
ERROR 2025/03/25 17:48:38 : unknown work unit awxtask5bf78fc7b74s29djhpWRQXR
ERROR 2025/03/25 17:48:53 Error locating unit: awxtask5bf78fc7b74s29d3nz2ZCmX
ERROR 2025/03/25 17:48:53 : unknown work unit awxtask5bf78fc7b74s29d3nz2ZCmX
ERROR 2025/03/25 17:49:11 Error locating unit: awxtask5bf78fc7b74s29d3nz2ZCmX
ERROR 2025/03/25 17:49:11 : unknown work unit awxtask5bf78fc7b74s29d3nz2ZCmX
ERROR 2025/03/25 17:49:38 Error locating unit: awxtask5bf78fc7b74s29djGTsAhAv
ERROR 2025/03/25 17:49:38 : unknown work unit awxtask5bf78fc7b74s29djGTsAhAv

AWX /api/v2/jobs/221782/

  ...
  "failed": false,
  "started": "2025-03-25T16:49:12.043306Z",
  "finished": "2025-03-25T16:49:38.813620Z", # UTC == receptor log 17:49
  "elapsed": 26.77,
  "job_args": "[\"podman\", \"run\", \"--rm\", \"--tty\", \"--interactive\", \"--workdir\", \"/runner/project\", \"-v\", \"/opt/awx/awx_tmp/awx_221782_54ztx1uq/:/runner/:Z\", \"--env-file\", \"/opt/awx/awx_tmp/awx_221782_54ztx1uq/artifacts/221782/env.list\", \"--quiet\", \"--name\", \"ansible_runner_221782\", \"--user=root\", \"--log-level=info\", \"--mount=type=bind,src=/home/awx/mounts/10-awx-ssh.conf,dst=/etc/ssh/ssh_config.d/10-awx-ssh.conf,relabel=shared,ro=true\", \"--network=slirp4netns:enable_ipv6=true\", \"--userns=keep-id:uid=1001,gid=0\", \"--user=runner\", \"--cap-drop=ALL\", \"--pull=missing\", \"image-registry...\", \"ansible-playbook\", \"-u\", \"root\", \"--diff\", \"-l\", \"localhost\", \"-i\", \"/runner/inventory\", \"-e\", \"@/runner/env/extravars\", \"sleep.yml\"]",
    "job_cwd": "/runner/project",

@2and3makes23 @manish_singh

I was able to fix this.
my executor was running behind the firewall and podman was not able to fetch the image from the quay.io registry.
Either get your container launched using the image available in your environment or either make sure your executor is able to reach to the quay repos.
This issue can be closed.

This must be a different issue than you @golakiyaalice had. What you describe (no access to image) does not match the scenario described here (jobs are executed successfully using resp. images).