Awx web crash loop backoff - password authentication failed for awx

I had a working/running setup previously. I was running into some issues and needed to troubleshoot. I accidentally deleted my namespace. (Yeah, big oops!) I worked through getting my storage correct and all pods are loading/running now except awx web.

I am using awx-operator version 2.11.0 which runs awx 23.7 I reviewed my logs and noted that they are saying password authentication failed for awx. I presume this is for the postgresql DB. Can anyone confirm this?

I also validated that a secret file is present and contains both a username and password. I noted that a DB is present on both of my worker nodes in the cluster. Is that correct?

Here is a snippet of my logs:
[root@gsil-kube01 ~]# kubectl logs awx-web-ffc587896-4d684 -n awx

...
Traceback (most recent call last):
  File "/usr/bin/awx-manage", line 8, in <module>
    sys.exit(manage())
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 159, in manage
    if (connection.pg_version // 10000) < 12:
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/connection.py", line 15, in __getattr__
    return getattr(self._connections[self._alias], item)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/functional.py", line 57, in __get__
    res = instance.__dict__[self.name] = self.func(instance)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 436, in pg_version
    with self.temporary_connection():
  File "/usr/lib64/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 705, in temporary_connection
    with self.cursor() as cursor:
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 330, in cursor
    return self._cursor()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 306, in _cursor
    self.ensure_connection()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 289, in ensure_connection
    self.connect()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/utils.py", line 91, in __exit__
    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 289, in ensure_connection
    self.connect()
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 270, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 26, in inner
    return func(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 275, in get_new_connection
    connection = self.Database.connect(**conn_params)
  File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg/connection.py", line 728, in connect
    raise ex.with_traceback(None)
django.db.utils.OperationalError: connection failed: password authentication failed for user "awx"
2024-05-09 19:48:13,372 WARN exited: awx-cache-clear (exit status 1; not expected)
2024-05-09 19:48:13,372 WARN exited: awx-cache-clear (exit status 1; not expected)
2024-05-09 19:48:13,469 INFO gave up: awx-cache-clear entered FATAL state, too many start retries too quickly
2024-05-09 19:48:13,469 INFO gave up: awx-cache-clear entered FATAL state, too many start retries too quickly
2024-05-09 19:48:13,469 WARN exited: ws-heartbeat (exit status 1; not expected)
2024-05-09 19:48:13,469 WARN exited: ws-heartbeat (exit status 1; not expected)
2024-05-09 19:48:14,471 INFO gave up: ws-heartbeat entered FATAL state, too many start retries too quickly
2024-05-09 19:48:14,471 INFO gave up: ws-heartbeat entered FATAL state, too many start retries too quickly
Processing Event: ver:3.0 server:supervisor serial:0 pool:superwatcher poolserial:0 eventname:PROCESS_STATE_FATAL len:72
2024-05-09 19:48:14,471 WARN received SIGQUIT indicating exit request
2024-05-09 19:48:14,471 WARN received SIGQUIT indicating exit request
2024-05-09 19:48:14,471 INFO waiting for superwatcher, nginx, uwsgi, daphne to die
2024-05-09 19:48:14,471 INFO waiting for superwatcher, nginx, uwsgi, daphne to die
...brutally killing workers...
2024-05-09 19:48:14,541 INFO stopped: nginx (exit status 0)
2024-05-09 19:48:14,541 INFO stopped: nginx (exit status 0)
2024-05-09 19:48:14,680 WARNING  [-] awx.conf.settings Database settings are not available, using defaults. error: connection failed: password authentication failed for user "awx"
2024-05-09 19:48:14,680 WARNING  Database settings are not available, using defaults. error: connection failed: password authentication failed for user "awx"
...

Here is my setup:

[root@gsil-kube01 ~]# cat storage.yaml 
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageClass.kubernetes.io/is-default-class: "true"
  name: local-storage
  namespace: awx
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: Immediate
#volumeBindingMode: WaitForFirstConsumer

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: postgres-pv
  namespace: awx
spec:
  capacity:
    storage: 2Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteMany
  persistentVolumeReclaimPolicy: Delete
  storageClassName: local-storage
  local:
    path: /var/lib/postgresql/data
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - gsil-kube01.idm.gsil.org
          - gsil-kube02.idm.gsil.org
          - gsil-kube03.idm.gsil.org

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-13-awx-postgres-13-0
  namespace: awx
spec:
  storageClassName: local-storage
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 2Gi

[root@gsil-kube01 ~]# cat awxvalues.yaml 
AWX:
  # enable use of awx-deploy template
  enabled: true
  name: awx
  spec:
    service_type: NodePort
    nodeport_port: 30080    
    admin_user: admin
    hostname: awx.idm.gsil.org
    image: gsil-docker1.idm.gsil.org:5001/quay.io/ansible/awx
    image_version: 23.7.0
    init_container_image: gsil-docker1.idm.gsil.org:5001/quay.io/ansible/awx-ee 
    init_container_image_version: latest
    ee_images:
    - name: AWX EE
      image: gsil-docker1.idm.gsil.org:5001/quay.io/ansible/awx-ee:latest    
    postgres_image: gsil-docker1.idm.gsil.org:5001/postgres
    postgres_image_version: "13"
    control_plane_ee_image: gsil-docker1.idm.gsil.org:5001/quay.io/ansible/awx-ee:latest
    redis_image: gsil-docker1.idm.gsil.org:5001/redis
    redis_image_version: "7"
    ldap_cacert_secret: awx-custom-certs
    ldap_password_secret: awx-ldap-password
    bundle_cacert_secret: awx-custom-certs
    extra_settings:
    - <LDAP_STUFF_HERE>....

customVolumes:
  postgres:
    enabled: true
    hostPath: /var/lib/postgresql/data
    size: 2Gi
    storageClassName: local-storage
  projects:
    enabled: true
    hostPath: /opt/projects/data
    size: 5Gi

[root@gsil-kube01 ~]# kubectl get sc,pv,pvc -n awx
NAME                                        PROVISIONER                    RECLAIMPOLICY   VOLUMEBINDINGMODE   ALLOWVOLUMEEXPANSION   AGE
storageclass.storage.k8s.io/local-storage   kubernetes.io/no-provisioner   Delete          Immediate           false                  70m

NAME                           CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                               STORAGECLASS    REASON   AGE
persistentvolume/postgres-pv   2Gi        RWX            Delete           Bound    awx/postgres-13-awx-postgres-13-0   local-storage            70m

NAME                                                  STATUS   VOLUME        CAPACITY   ACCESS MODES   STORAGECLASS    AGE
persistentvolumeclaim/postgres-13-awx-postgres-13-0   Bound    postgres-pv   2Gi        RWX            local-storage   70m

[root@gsil-kube01 ~]# kubectl get secret -n awx
NAME                             TYPE                 DATA   AGE
awx-admin-password               Opaque               1      19m
awx-app-credentials              Opaque               3      18m
awx-broadcast-websocket          Opaque               1      19m
awx-custom-certs                 Opaque               1      24h
awx-ldap-password                Opaque               1      22m
awx-postgres-configuration       Opaque               6      18m
awx-receptor-ca                  kubernetes.io/tls    2      18m
awx-receptor-work-signing        Opaque               2      18m
awx-secret-key                   Opaque               1      19m
redhat-operators-pull-secret     Opaque               1      19m
sh.helm.release.v1.gsil-awx.v1   helm.sh/release.v1   1      19m



[root@gsil-kube01 ~]# kubectl edit secret -n awx awx-postgres-configuration
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
  database: YXd4
  host: YXd4LXBvc3RncmVzLTEz
  password: NU9QdnBOWDJhcTR3SjFQQTJxYkRMUDVqaEpDN3dmcE4=
  port: NTQzMg==
  type: bWFuYWdlZA==
  username: YXd4
kind: Secret
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: '{"apiVersion":"v1","kind":"Secret","metadata":{"labels":{"app.kubernetes.io/component":"awx","app.kubernetes.io/managed-by":"awx-operator","app.kubernetes.io/operator-version":"2.11.0","app.kubernetes.io/part-of":"awx"},"name":"awx-postgres-configuration","namespace":"awx"},"stringData":{"database":"awx","host":"awx-postgres-13","password":"5OPvpNX2aq4wJ1PA2qbDLP5jhJC7wfpN","port":"5432","type":"managed","username":"awx"}}'
  creationTimestamp: "2024-05-10T11:43:41Z"
  labels:
    app.kubernetes.io/component: awx
    app.kubernetes.io/managed-by: awx-operator
    app.kubernetes.io/operator-version: 2.11.0
    app.kubernetes.io/part-of: awx
  name: awx-postgres-configuration
  namespace: awx
  ownerReferences:
  - apiVersion: awx.ansible.com/v1beta1
    kind: AWX
    name: awx
    uid: 79af9d9a-61de-42d5-8d38-6c2e39dd44d6
  resourceVersion: "17960391"
  uid: c584c1e6-9f9b-4e18-8673-c9b3f1877ce0
type: Opaque

[root@gsil-kube02 pgdata]# ls -lah
total 64K
drwx------. 19 systemd-coredump root  4.0K May 10 11:43 .
drwxr-xr-x.  3 root             root    20 Mar 13 17:50 ..
drwx------.  6 systemd-coredump input   54 Mar 13 17:50 base
drwx------.  2 systemd-coredump input 4.0K May 10 11:44 global
drwx------.  2 systemd-coredump input    6 Mar 13 17:50 pg_commit_ts
drwx------.  2 systemd-coredump input    6 Mar 13 17:50 pg_dynshmem
-rw-------.  1 systemd-coredump input 4.8K Mar 13 17:50 pg_hba.conf
-rw-------.  1 systemd-coredump input 1.6K Mar 13 17:50 pg_ident.conf
drwx------.  4 systemd-coredump input   68 May 10 11:48 pg_logical
drwx------.  4 systemd-coredump input   36 Mar 13 17:50 pg_multixact
drwx------.  2 systemd-coredump input    6 Mar 13 17:50 pg_notify
drwx------.  2 systemd-coredump input    6 Mar 13 17:50 pg_replslot
drwx------.  2 systemd-coredump input    6 Mar 13 17:50 pg_serial
drwx------.  2 systemd-coredump input    6 Mar 13 17:50 pg_snapshots
drwx------.  2 systemd-coredump input    6 May 10 11:43 pg_stat
drwx------.  2 systemd-coredump input   84 May 10 11:58 pg_stat_tmp
drwx------.  2 systemd-coredump input   18 Mar 13 17:50 pg_subtrans
drwx------.  2 systemd-coredump input    6 Mar 13 17:50 pg_tblspc
drwx------.  2 systemd-coredump input    6 Mar 13 17:50 pg_twophase
-rw-------.  1 systemd-coredump input    3 Mar 13 17:50 PG_VERSION
drwx------.  3 systemd-coredump input   92 May  9 17:09 pg_wal
drwx------.  2 systemd-coredump input   18 Mar 13 17:50 pg_xact
-rw-------.  1 systemd-coredump input   88 Mar 13 17:50 postgresql.auto.conf
-rw-------.  1 systemd-coredump input  28K Mar 13 17:50 postgresql.conf
-rw-------.  1 systemd-coredump input   36 May 10 11:43 postmaster.opts
-rw-------.  1 systemd-coredump input  101 May 10 11:43 postmaster.pid
[root@gsil-kube02 pgdata]# pwd
/var/lib/postgresql/data/data/pgdata

[root@gsil-kube03 pgdata]# ls -lah
total 60K
drwx------. 19 systemd-coredump root  4.0K May 10 11:41 .
drwxr-xr-x.  3 root             root    20 Feb 22 18:50 ..
drwx------.  6 systemd-coredump input   54 Mar  1 19:02 base
drwx------.  2 systemd-coredump input 4.0K May  9 19:47 global
drwx------.  2 systemd-coredump input    6 Mar  1 19:02 pg_commit_ts
drwx------.  2 systemd-coredump input    6 Mar  1 19:02 pg_dynshmem
-rw-------.  1 systemd-coredump input 4.8K Mar  1 19:02 pg_hba.conf
-rw-------.  1 systemd-coredump input 1.6K Mar  1 19:02 pg_ident.conf
drwx------.  4 systemd-coredump input   68 May 10 11:41 pg_logical
drwx------.  4 systemd-coredump input   36 Mar  1 19:02 pg_multixact
drwx------.  2 systemd-coredump input    6 May  9 19:46 pg_notify
drwx------.  2 systemd-coredump input    6 Mar  1 19:02 pg_replslot
drwx------.  2 systemd-coredump input    6 Mar  1 19:02 pg_serial
drwx------.  2 systemd-coredump input    6 Mar  1 19:02 pg_snapshots
drwx------.  2 systemd-coredump input   84 May 10 11:41 pg_stat
drwx------.  2 systemd-coredump input    6 May 10 11:41 pg_stat_tmp
drwx------.  2 systemd-coredump input   18 May  3 13:29 pg_subtrans
drwx------.  2 systemd-coredump input    6 Mar  1 19:02 pg_tblspc
drwx------.  2 systemd-coredump input    6 Mar  1 19:02 pg_twophase
-rw-------.  1 systemd-coredump input    3 Mar  1 19:02 PG_VERSION
drwx------.  3 systemd-coredump input   92 May  6 10:35 pg_wal
drwx------.  2 systemd-coredump input   18 Mar  1 19:02 pg_xact
-rw-------.  1 systemd-coredump input   88 Mar  1 19:02 postgresql.auto.conf
-rw-------.  1 systemd-coredump input  28K Mar  1 19:02 postgresql.conf
-rw-------.  1 systemd-coredump input   36 May  9 19:46 postmaster.opts
[root@gsil-kube03 pgdata]# pwd
/var/lib/postgresql/data/data/pgdata

What suggestions can anyone make to troubleshoot this issue? Thanks!

It is not recommended at all to use hostPath-based PV in a multi-node cluster, because hostPath directly mounts the local directories of the nodes, but there is no mechanism to synchronize files between nodes.

Perhaps PSQL existed in gsil-kube03 until now. Now it seems to be in gsil-kube02. So the PSQL that is now working is a completely new database and probably does not contain any data.

First, you should bind PSQL pod to gsil-kube03. There is a postgres_selector to achieve this in the simple way. Add new label to your gsil-kube03,

kubectl label nodes gsil-kube03.idm.gsil.org nodefor=psql

Then specify postgres_selector in your awxvalues.yaml and deploy it.

AWX:
  ...
  enabled: true
  name: awx
  spec:
    ...
    postgres_selector: |
      nodefor: psql

Maybe now your PSQL will work on gsil-kube03 and PSQL will start using the old
/var/lib/postgresql/data/data/pgdata on gsil-kube03.

However the additional concerns with the removal of namespace are that awx-secret-key, awx-receptor-ca and awx-receptor-work-signing have been also removed and re-generated.
I think the following issues will probably arise later:

  • All saved Credentials will no longer work. There is no way to recover it, unless re-create all of them
  • All execution nodes are expired. If you have created any type of external execution nodes or hop nodes, you have to re-generate install bundle and re-install them (before re-installing the bundle, remove /etc/receptor directory on each remote nodes, since install bundle does not overwrite existing cert files)
2 Likes

Thanks, that makes sense. I wasn’t confident which version was current. Thanks for helping to sort that out.

I think the following issues will probably arise later:
• All saved Credentials will no longer work. There is no way to recover it, unless re-create all of them
• All execution nodes are expired. If you have created any type of external execution nodes or hop nodes, you have to re-generate install bundle and re-install them (before re-installing the bundle, remove /etc/receptor directory on each remote nodes, since install bundle does not overwrite existing cert files)

Not a big deal on the saved credentials. We only had one credential setup berfore I made my mistake. That will be easy to recreate.

I am not really clear on the execution nodes. (or that I even have that setup, for that matter) Are you referring to the execution environment specified in awxvalues.yaml?

If so, I think I understand that I should remove /etc/receptor directory so that the install can create new files/folders. Correct?

Oh please note I don’t have any proof, I just looked the timestamp for the files under gsil-kube02 and gsil-kube03 and I guess the older files are correct.

This is good news :smiley:

Nop, execution nodes are different from execution environment.
If you didn’t added any Instances and Instance Groups in AWX, you can ignore my concerns. Refer to the docs to know what the instances are: 8. Managing Capacity With Instances — Ansible AWX community documentation

OK, I tried the code you suggested. Postgres pod is not starting. kubectl describe shows that 3 nodes didn't match pod's node affinity/selector

Ok, I didn’t have that set up. Thanks for clarifying. Nothing for me to do here.

Ultimately, if I had to wipe the DB and recreate it, I wouldn’t be set back very much. I was just starting to use the system. Better to learn and get mistakes out of the way earlier rather than later.

Could you please provide:

kubectl get node --show-labels
kubectl -n awx describe pod awx-postgres-13-0

OK, I made a typo mistake when applying the label to my node. I corrected that. I tried to label kube03 first and observe what happens in the logs when awx-web pod starts.

It shows the awx password failure, the same as earlier.

I then brought down my helm install, I removed the label from kube03 and applied the label to kube02. I observed the logs now. This also fails for awx password.

So it seems like it makes no difference which node I try to run the DB from.
I am guessing the easiest course of action is to wipe the DB folder path and start fresh? Again, I have lost very little at this point but it was worth exploring to see if I could save what had already been set up. Thoughts?

Hmm, forgot to confirm this but does awx-postgres-configuration secret contain correct OLD password?
If it is re-generated after wiping namespace, it may contains newly generated random password that does not match existing old data files on kube03.

I’m not sure. I don’t recall having set a password. It also looks like the secret gets generated and defined when the helm chart gets applied.

Maybe I could try to connect to the DB as the user awx and try to guess a few passwords that may have been set? If I find it, I could modify awx-postgres-configuration secret.

Indeed this is technically possible, but the easiest way is resetting password in PSQL:

# Reveal password
$ kubectl -n awx get secret awx-postgres-configuration -o json | jq -r '.data.password' | base64 -d; echo
5OPv****************wfpN

# Reset password
$ kubectl -n awx exec -it awx-postgres-13-0 -- bash
bash-5.1$ psql
postgres=# ALTER USER awx WITH PASSWORD 'YOUR_PASSWORD_HERE';
postgres-# \q
bash-5.1$ exit

I got this far in the procedure-
kubectl -n awx exec -it awx-postgres-13-0 -- bash
bash-5.1$ psql

Then I get a message:

psql: error: connection to server on socket “/var/run/postgresql/.s.PGSQL.5432” failed: FATAL: role “root” does not exist

But I noted this CRIT message from the first log file line:

Supervisor is running as root. Privileges were not dropped because no user is specified in the config file. If you intend to run as root, you can set user=root in the config file to avoid this message.

Are these two things related?

I’mAFK now but try psql -U awx or psql -U awx -d awx.

I will be back tomorrow :smile:

will do. I will be back on 5/13

Okay psql -U awx works for me.

$ kubectl -n awx exec -it awx-postgres-13-0 -- bash
root@awx-postgres-13-0:/# psql -U awx
psql (13.15 (Debian 13.15-1.pgdg120+1))
Type "help" for help.

awx=# ALTER USER awx WITH PASSWORD '5OPv****************wfpN';
ALTER ROLE
awx=# \q
root@awx-postgres-13-0:/# exit

I am all set. My web pod is running as expected now. THANK YOU for the assistance! :smiley:

1 Like