Controller can no longer create automation nodes

A multi-attach error usually means you have the PVC mounted in multiple containers, but the storageClass that created the PVC does not support RWX. I would check to see if any other pods are mounting the same Postgres pvc. As a first step.

AWX Team

Thank you! Will do.

Kevin

I’m almost sure the PVC thing is a red herring. First, the GUI is working fine. I have all my data, all my templates, all my inventories, and the job records exist. Second, I have this yaml for the PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/vsphere-volume
creationTimestamp: "2023-01-16T20:17:46Z"
finalizers:
- kubernetes.io/pvc-protection
labels:
app.kubernetes.io/component: database
app.kubernetes.io/instance: postgres-13-it-mass-awx
app.kubernetes.io/managed-by: awx-operator
app.kubernetes.io/name: postgres-13

This smells like something is wrong in the working pod. The controller can tell it to spin up a new automation pod, but it cannot do it. I don’t believe this is likely a database error.

So, here is the log of a failure trying to update a github project and then a success from an identical system. What I see is the EOF event happens before the post run event on the failure. That’s really thin gruel for analysis and probably means nothing. I’m told to look at stdout for details. Any insights into how I might do that? Anywhere else I might be able to look for clues?

2023-07-07 13:51:02,277 INFO [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 controller node chosen
2023-07-07 13:51:02,278 INFO [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 execution node chosen
2023-07-07 13:51:02,328 INFO [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 waiting
2023-07-07 13:51:02,456 INFO [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 pre run
2023-07-07 13:51:02,461 INFO [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 preparing playbook
2023-07-07 13:51:02,498 INFO [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 running playbook
2023-07-07 13:51:02,531 INFO [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 work unit id received
2023-07-07 13:51:02,546 INFO [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 work unit id assigned
2023-07-07 13:51:04,849 INFO [-] awx.analytics.job_lifecycle projectupdate-139415 stats wrapup finished
2023-07-07 13:51:05,056 INFO [6592890e3fcd4ba39b4605a69fde8808] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 139415
2023-07-07 13:51:05,058 INFO [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 post run
2023-07-07 13:51:05,146 INFO [6592890e3fcd4ba39b4605a69fde8808] awx.analytics.job_lifecycle projectupdate-139415 finalize run
2023-07-07 13:51:05,158 WARNING [6592890e3fcd4ba39b4605a69fde8808] awx.main.dispatch project_update 139415 (failed) encountered an error (rc=None), please see task stdout for details.

success:

2023-07-07 13:52:25,843 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 controller node chosen
2023-07-07 13:52:25,843 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 execution node chosen
2023-07-07 13:52:25,901 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 waiting
2023-07-07 13:52:26,327 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 pre run
2023-07-07 13:52:26,354 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 preparing playbook
2023-07-07 13:52:26,393 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 running playbook
2023-07-07 13:52:26,414 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 work unit id received
2023-07-07 13:52:26,430 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 work unit id assigned
2023-07-07 13:52:52,571 INFO [-] awx.analytics.job_lifecycle projectupdate-1366 stats wrapup finished
2023-07-07 13:52:52,763 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 post run
2023-07-07 13:52:52,763 INFO [927983d6cb734c91babf825790086fb3] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 1366
2023-07-07 13:52:53,503 INFO [927983d6cb734c91babf825790086fb3] awx.analytics.job_lifecycle projectupdate-1366 finalize run

Another instance of the same event on the failing system with DEBUG turned on. I see nothing helpful.

2023-07-07 14:14:41,880 DEBUG [2c324d9e6a9c4277acd4b0a9fb9ef45a] awx.main.dispatch task 05abd5d7-869c-4501-9d39-74b36f900a93 starting awx.main.scheduler.tasks.task_manager(*[])
2023-07-07 14:14:41,883 DEBUG [2c324d9e6a9c4277acd4b0a9fb9ef45a] awx.main.dispatch scaling down worker pid:13924
2023-07-07 14:14:41,884 DEBUG [2c324d9e6a9c4277acd4b0a9fb9ef45a] awx.main.dispatch scaling down worker pid:13930
2023-07-07 14:14:41,885 DEBUG [2c324d9e6a9c4277acd4b0a9fb9ef45a] awx.main.dispatch scaling down worker pid:13932
2023-07-07 14:14:41,885 DEBUG [56b5ae63bf2446ad9a23f8cd27f487ec] awx.main.dispatch worker exiting gracefully pid:13924
2023-07-07 14:14:41,886 DEBUG [56b5ae63bf2446ad9a23f8cd27f487ec] awx.main.dispatch worker exiting gracefully pid:13930
2023-07-07 14:14:41,887 DEBUG [2c324d9e6a9c4277acd4b0a9fb9ef45a] awx.main.dispatch task 959118df-2b92-4340-a4e7-dd1550ad1356 starting awx.main.scheduler.tasks.dependency_manager(*[])
2023-07-07 14:14:41,887 DEBUG [2c324d9e6a9c4277acd4b0a9fb9ef45a] awx.main.dispatch task ab9c6c4c-48b9-431c-830a-47c35b22b495 starting awx.main.analytics.analytics_tasks.send_subsystem_metrics(*[])
2023-07-07 14:14:41,889 DEBUG [56b5ae63bf2446ad9a23f8cd27f487ec] awx.main.dispatch worker exiting gracefully pid:13932
2023-07-07 14:14:42,869 DEBUG [9d34897d960f4a909943fc691cc5de35] awx.main.dispatch task 29851216-1182-48ee-a77f-24699201625a starting awx.main.scheduler.tasks.task_manager(*[])
2023-07-07 14:14:42,911 INFO [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 controller node chosen
2023-07-07 14:14:42,911 INFO [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 execution node chosen
2023-07-07 14:14:42,965 INFO [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 waiting
2023-07-07 14:14:42,977 DEBUG [9d34897d960f4a909943fc691cc5de35] awx.main.dispatch task edd2d16b-f1a3-47c9-9b24-5859d5724961 starting awx.main.tasks.jobs.RunProjectUpdate(*[139419])
2023-07-07 14:14:43,173 INFO [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 pre run
2023-07-07 14:14:43,205 INFO [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 preparing playbook
2023-07-07 14:14:43,242 INFO [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 running playbook
2023-07-07 14:14:43,264 INFO [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 work unit id received
2023-07-07 14:14:43,277 INFO [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 work unit id assigned
2023-07-07 14:14:44,786 DEBUG [-] awx.main.commands.run_callback_receiver ProjectUpdateEvent.objects.bulk_create(1)
2023-07-07 14:14:44,786 DEBUG [-] awx.main.commands.run_callback_receiver ProjectUpdateEvent.objects.bulk_create(1)
2023-07-07 14:14:44,787 DEBUG [-] awx.main.commands.run_callback_receiver ProjectUpdateEvent.objects.bulk_create(1)
2023-07-07 14:14:44,794 DEBUG [-] awx.main.commands.run_callback_receiver ProjectUpdateEvent.objects.bulk_create(1)
2023-07-07 14:14:46,040 DEBUG [-] awx.main.commands.run_callback_receiver ProjectUpdateEvent.objects.bulk_create(1)
2023-07-07 14:14:46,043 DEBUG [-] awx.main.commands.run_callback_receiver ProjectUpdateEvent.objects.bulk_create(1)
2023-07-07 14:14:46,093 INFO [-] awx.analytics.job_lifecycle projectupdate-139419 stats wrapup finished
2023-07-07 14:14:46,290 INFO [9d34897d960f4a909943fc691cc5de35] awx.main.commands.run_callback_receiver Starting EOF event processing for Job 139419
2023-07-07 14:14:46,291 INFO [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 post run
2023-07-07 14:14:46,368 INFO [9d34897d960f4a909943fc691cc5de35] awx.analytics.job_lifecycle projectupdate-139419 finalize run
2023-07-07 14:14:46,374 WARNING [9d34897d960f4a909943fc691cc5de35] awx.main.dispatch project_update 139419 (failed) encountered an error (rc=None), please see task stdout for details.
2023-07-07 14:14:46,374 DEBUG [9d34897d960f4a909943fc691cc5de35] awx.main.dispatch task edd2d16b-f1a3-47c9-9b24-5859d5724961 starting awx.main.tasks.system.handle_work_error(*[])
2023-07-07 14:14:46,382 DEBUG [9d34897d960f4a909943fc691cc5de35] awx.main.dispatch task edd2d16b-f1a3-47c9-9b24-5859d5724961 starting awx.main.tasks.system.handle_work_success(*[])

OK. This may have been due to an expired Project credential. The error does not look like I expect, but who am I to say. I reinstalled the yaml for the whole project then got a credential error. I fixed the credential and things are much happier now. So, I don’t know why reinstalling the project got me to my error, but I’ll take it.