AWX taking ~>15 mins to reconnect after DB failover (multi-region PG cluster with pgpool)

Setup:

  • AWX deployed in Kubernetes using awx-operator (in Region1 and Region2).
  • Both AWX deployments use a shared PostgreSQL cluster deployed across both regions.
  • Database cluster details:
    • DB1 (Region1)
    • DB2 (Region1)
    • DB3 (Region2)
    • PGAF (Postgres Auto Failover) in Region2
  • DBs are configured so that:
    • Only one DB is Primary (read-write) at any time.
    • The other two are Standby (read-only) and in sync.
    • A cron job runs every minute on all DB nodes to promote a new primary during failover/switchover.
  • AWX connects to the database through pgpool on port 51902

Observed behavior:

  • During failover from DB2 → DB1:
    • New DB becomes Primary within ~59s.
    • AWX successfully reconnects in ~1 minute. :white_check_mark:
  • During failover from DB1 → DB2:
    • New DB becomes Primary within ~59s.
    • But AWX takes ~15 minutes or more to detect and reconnect to the new primary. :cross_mark:

Troubleshooting / Attempts so far:

  • Specified all three DB IPs in AWX connection string → no improvement.
  • Set PGCONNECT_TIMEOUT=10 → no improvement.
  • Manually restarted AWX deployment pods (rollout restart) → issue still persists.

Problem:
Failover detection is inconsistent. AWX reconnects quickly in one direction (DB2 → DB1) but takes ~15 minutes in the other direction (DB1 → DB2).

Ask:
Has anyone seen similar behavior with AWX and PostgreSQL failover (with pgpool/PGAF)?

  • Why might AWX detect failover faster in one direction but not the other?
  • Are there recommended AWX/Postgres/pgpool settings to improve failover detection and reconnection times?