Setup:
- AWX deployed in Kubernetes using awx-operator (in Region1 and Region2).
- Both AWX deployments use a shared PostgreSQL cluster deployed across both regions.
- Database cluster details:
- DB1 (Region1)
- DB2 (Region1)
- DB3 (Region2)
- PGAF (Postgres Auto Failover) in Region2
- DBs are configured so that:
- Only one DB is Primary (read-write) at any time.
- The other two are Standby (read-only) and in sync.
- A cron job runs every minute on all DB nodes to promote a new primary during failover/switchover.
- AWX connects to the database through pgpool on port
51902
Observed behavior:
- During failover from DB2 → DB1:
- New DB becomes Primary within ~59s.
- AWX successfully reconnects in ~1 minute.
- During failover from DB1 → DB2:
- New DB becomes Primary within ~59s.
- But AWX takes ~15 minutes or more to detect and reconnect to the new primary.
Troubleshooting / Attempts so far:
- Specified all three DB IPs in AWX connection string → no improvement.
- Set
PGCONNECT_TIMEOUT=10
→ no improvement. - Manually restarted AWX deployment pods (
rollout restart
) → issue still persists.
Problem:
Failover detection is inconsistent. AWX reconnects quickly in one direction (DB2 → DB1) but takes ~15 minutes in the other direction (DB1 → DB2).
Ask:
Has anyone seen similar behavior with AWX and PostgreSQL failover (with pgpool/PGAF)?
- Why might AWX detect failover faster in one direction but not the other?
- Are there recommended AWX/Postgres/pgpool settings to improve failover detection and reconnection times?