AWX won’t connect to external PG (connection failed: Success).
Hi,
Context :
- my AWX cluster (23.9.0) is deployed on a Rancher cluster and I’m targetting a standalone PG
- Rancher workers and PG are running on RHEL 8
- the installation is fine on a dev environment, but is not when shifting to a prod env.
My awx pod (task or web) shows the following stack when I execute the awx-manage utility :
bash-5.1$ awx-manage
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 289, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 26, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 270, in connect
self.connection = self.get_new_connection(conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 26, in inner
return func(*args, **kwargs)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 275, in get_new_connection
connection = self.Database.connect(**conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg/connection.py", line 728, in connect
raise ex.with_traceback(None)
psycopg.OperationalError: connection failed: Success
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/bin/awx-manage", line 8, in <module>
sys.exit(manage())
File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 159, in manage
[ ... ]
From that, I tried to tackle down all the common pitfalls (firewall, selinux, certificates, etc) but there is still something I cannot figure out.
I triple checked the /etc/tower/conf.d/credentials.py and everything is fine.
Basic network checks are OK :
bash-5.1$ pg_isready -h <mypg> -p 8533
<mypg>:8533 - accepting connections
I tried a direct connection to my db with the PG client inside the POD and everything is fine as well :
bash-5.1$ psql -h <mypg> -p 8533 -d awx -U awx
Password for user awx:
psql (13.11, server 12.11)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.
awx=>
The connection is established ; I can create a test table.
And I can see the connection in my PG logs :
2024-02-28 15:01:42 CET [62781]: [2-1] [0] user=[awx], db=[awx], client=[<ip>>], appli=[[unknown]], sessionid=[65df3cc6.f53d]LOG: connection authorized: user=awx database=awx application_name=psql SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)
Conclusion : there’s no problem with NFtables, pg_hba or any network connectivity.
From that, I deduced there might be a problem inside Django.
I thus debugged the connection settings and it’s the very same content as what’s inside the credentials.py :
print(connection._connections.settings)
{'default': {'ATOMIC_REQUESTS': True, 'ENGINE': 'awx.main.db.profiled_pg', 'NAME': 'awx', 'USER': 'awx', 'PASSWORD': '<pwd>', 'HOST': '<mypg>>', 'PORT': '8533', 'OPTIONS': {'sslmode': 'verify-ca', 'sslrootcert': '/etc/pki/tls/certs/ca-bundle.crt', 'application_name': 'awx-3768--task-69c5585c5-542ql'}, 'DEBUG': True, 'AUTOCOMMIT': True, 'CONN_MAX_AGE': 0, 'CONN_HEALTH_CHECKS': False, 'TIME_ZONE': None, 'TEST': {'CHARSET': None, 'COLLATION': None, 'MIGRATE': True, 'MIRROR': None, 'NAME': None}}}
(I tried to add a debug flag, but it did not help)
What’s interesting is that when I’ve got the error message “connection failed: Success”, there is abolutely no activity in the PG log : the connection is never established.
Whereas, on my dev env, where everything is working fine, I can see logs such as :
2024-02-28 15:09:23 CET [43641]: [3-1] [0] user=[awx], db=[awx], client=[<ip>], appli=[awx-84-dispatcher_worker-task-8fb75bd4f-2lzzn], sessionid=[65df3e8a.aa79]LOG: disconnection: session time: 0:00:08.979 user=awx database=awx host=<ip> port=41836
The client agent is clearly defined as a pod.
Here I am, stuck with that failed connection and I don’t know where to go to get rid of that. It must be something related to my machines, but if I could get some debug pointers, it would be helpful.
If anyone has some tips, I would gladly read them.
Thanks.
P.S : one of the main differences between my dev env and my prod env is that SELinux is in enforcing mode, but I tried permissive mode without any success. And the other one is that IPv6 is activated on my prod env. It might be related to my problem.