External PostgreSQL connectivity problem

AWX won’t connect to external PG (connection failed: Success).

Hi,

Context :

  • my AWX cluster (23.9.0) is deployed on a Rancher cluster and I’m targetting a standalone PG
  • Rancher workers and PG are running on RHEL 8
  • the installation is fine on a dev environment, but is not when shifting to a prod env.

My awx pod (task or web) shows the following stack when I execute the awx-manage utility :

    bash-5.1$ awx-manage
    Traceback (most recent call last):
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 289, in ensure_connection
        self.connect()
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 26, in inner
        return func(*args, **kwargs)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/base/base.py", line 270, in connect
        self.connection = self.get_new_connection(conn_params)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/utils/asyncio.py", line 26, in inner
        return func(*args, **kwargs)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/django/db/backends/postgresql/base.py", line 275, in get_new_connection
        connection = self.Database.connect(**conn_params)
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/psycopg/connection.py", line 728, in connect
        raise ex.with_traceback(None)
    psycopg.OperationalError: connection failed: Success

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
    File "/usr/bin/awx-manage", line 8, in <module>
        sys.exit(manage())
    File "/var/lib/awx/venv/awx/lib64/python3.9/site-packages/awx/__init__.py", line 159, in manage
    [ ... ]

From that, I tried to tackle down all the common pitfalls (firewall, selinux, certificates, etc) but there is still something I cannot figure out.

I triple checked the /etc/tower/conf.d/credentials.py and everything is fine.

Basic network checks are OK :

bash-5.1$ pg_isready -h <mypg> -p 8533
<mypg>:8533 - accepting connections

I tried a direct connection to my db with the PG client inside the POD and everything is fine as well :

    bash-5.1$ psql -h <mypg> -p 8533 -d awx -U awx
    Password for user awx: 
    psql (13.11, server 12.11)
    SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
    Type "help" for help.

    awx=> 

The connection is established ; I can create a test table.
And I can see the connection in my PG logs :

    2024-02-28 15:01:42 CET [62781]: [2-1] [0] user=[awx], db=[awx], client=[<ip>>], appli=[[unknown]], sessionid=[65df3cc6.f53d]LOG:  connection authorized: user=awx database=awx application_name=psql SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off)

Conclusion : there’s no problem with NFtables, pg_hba or any network connectivity.

From that, I deduced there might be a problem inside Django.
I thus debugged the connection settings and it’s the very same content as what’s inside the credentials.py :

    print(connection._connections.settings)

    {'default': {'ATOMIC_REQUESTS': True, 'ENGINE': 'awx.main.db.profiled_pg', 'NAME': 'awx', 'USER': 'awx', 'PASSWORD': '<pwd>', 'HOST': '<mypg>>', 'PORT': '8533', 'OPTIONS': {'sslmode': 'verify-ca', 'sslrootcert': '/etc/pki/tls/certs/ca-bundle.crt', 'application_name': 'awx-3768--task-69c5585c5-542ql'}, 'DEBUG': True, 'AUTOCOMMIT': True, 'CONN_MAX_AGE': 0, 'CONN_HEALTH_CHECKS': False, 'TIME_ZONE': None, 'TEST': {'CHARSET': None, 'COLLATION': None, 'MIGRATE': True, 'MIRROR': None, 'NAME': None}}}

(I tried to add a debug flag, but it did not help)
What’s interesting is that when I’ve got the error message “connection failed: Success”, there is abolutely no activity in the PG log : the connection is never established.

Whereas, on my dev env, where everything is working fine, I can see logs such as :

    2024-02-28 15:09:23 CET [43641]: [3-1] [0] user=[awx], db=[awx], client=[<ip>], appli=[awx-84-dispatcher_worker-task-8fb75bd4f-2lzzn], sessionid=[65df3e8a.aa79]LOG:  disconnection: session time: 0:00:08.979 user=awx database=awx host=<ip> port=41836

The client agent is clearly defined as a pod.

Here I am, stuck with that failed connection and I don’t know where to go to get rid of that. It must be something related to my machines, but if I could get some debug pointers, it would be helpful.
If anyone has some tips, I would gladly read them.

Thanks.

P.S : one of the main differences between my dev env and my prod env is that SELinux is in enforcing mode, but I tried permissive mode without any success. And the other one is that IPv6 is activated on my prod env. It might be related to my problem.

  1. Are you running your network tests from inside an AWX pod?
  2. Have you verified you have a complete IPv6 connection from AWX to the PostgreSQL box?

That error output is hilariously unhelpful (task failed successfully!), and I have no experience with a similar setup to yours. My only guess is that maybe IPv6 is failing and your tests with pg* tools gracefully succeed with an IPv4 failover (or maybe prefers IPv4 over IPv6). Meanwhile django’s python has no IPv6 to IPv4 failover and fails.

Thanks for your reply. Yes, the tests were made within the AWX task pod.
The more I think about it and the more I suspect an IPv6 issue. What led me on that track was a ping which replied an IPv6 address.

The theory about Django not being able to handle IPv6 addresses sounds (sadly) very possible.

I’ll explore further with some TCP dumps tomorrow !

A couple things then, did you enable the host(ssl) lines in pg_hba.conf for IPv6, and does the listen_addresses include the IPv6 address in postgresql.conf?

Aside from double-checking the pg settings, you could also try toggling the IPv6 listener setting in the AWX CRD just to see if that changes things.

1 Like

Thanks for the clues : you were right. I was misled by what was shown in the netstat command result.
I read “::/1” as any IPv6 address whereas it’s only the local interface (fe80).
The correct line would have been “::/0”.

Plus the fact I had a response over UDP on the port, it led me down the wrong path.

Since I added the address in the listen_addresses, everything’s now fine.

Yet, I still don’t get why Django would use the IPv6 address, and psql the IPv4.
But that’s not that important.

Thanks again !

1 Like

Glad I could help! I was pretty confident it was a networking configuration problem, but wasn’t sure where. It also helped that I had recently set up a postgresql vm for a DBA over the last week. Ran into similar gotchas just on IPv4 since it was a first time for me lol.

Hey guys,

I am deploying AWX in an environment that has no support for IPv6. I set the flag ipv6_disable to true in my AWX CRD but am seeing the same error as OP in this post. Do I have to alter the Django configuration to avoid IPv6 usage? Even if I reset the pg_hba.conf values the connection would still fail over that protocol.

  • Max

Hi Max,

If you don’t have any IPv6 activated on your machines, then don’t worry about it.
I would strongly suggest to desactivate all the security mechanisms that could interfere with your connection to PG (firewall on both machines, add a wide range in your pg_hba.conf).

Then run the command pg_isready within your pod, in the AWX task container.
It will try to establish a connection just as awx-manage would.
If it succeeds, then activate your securities one by one and determine which one is causing you trouble.

If it fails, it might be network related.

Hope this helps !

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.