I’m not a developer, but I have setup our companies internal RedHat Ansible Tower systems with the assistance of RedHat Architecture/Consulting.
NOTE: The design below was built with a lot of input during a RedHat consulting engagement, but it is explicitly NOT the configuration that RedHat/Ansible is ready to generally support. We’re still working through this with RedHat and ensuring both the Architecture/Consulting team is on the same page as the Support team.
We have a number of well connected datacenters across the world (two in North America, one in Canada, one in the Europe) and our needs were to address disaster recovery over performance in the sites.
We started with two Postgres databases with automatic replication, one in each of the North American DCs (they are many states apart), and a load balanced IP is setup that uses company-wide DNS entry pointing to a “shared” IP address. That IP address is configured to ONLY point to the single Postgres DB that is active and there is a short cutover time when the system switches from the DB in Site-A to Site-B. The cutover is not automated at this time - we want to ensure the state of the DB and Ansible Tower systems is managed during the transition. Behind the scenes, the DB replication is handled by our DB team using the “normal” replication process, but it is the same replication process we use with many other products and well vetted. The benefit is that my team treat the DB as a ‘service’ and we just maintain the Tower application and servers. (I’m not a Postgres expert, just a user so you’re welcome to ask questions but I’ll have to pass them along to my DB team for answers.)
The primary Tower cluster (three in Site-A) are a “normal” Tower cluster sharing the same RabbitMQ information and pointing to the same Postgres instance (the DR load balancer IP address mentioned above).
There is a load balanced IP and DNS name for each cluster (tower-A.site.company.com points to the load balanced IP for the Site-A cluster, tower-B.site.company.com points to Site-B cluster, etc). There is a health check within the load balancer configuration that monitors the health of each server in the cluster, so we can cleanly boot systems within the cluster without impacting the end-user. The failover of the nodes within the cluster is automatic so server/service outages are addressed automatically.
Finally, there is a load balanced IP and DNS name for all the clusters (tower.company.com) that is made up of the cluster load balanced DNS entries (tower-A.site, tower-B.site, tower-C.site, etc…). Like the Postgres DB load balanced IP/DNS, this too is a manual failover at this time.
The secondary and tertiary Tower clusters in Site-B and Site-C are also a “normal” Tower cluster and they share their own RabbitMQ information within the cluster but that is DIFFERENT than the R-MQ config for the other sites. Each cluster points to the same DR load balancer IP for the Postgres DB connection.
At any time, there is only one cluster active at any time - the others are powered up but all the Tower services are disabled. We’ve automated the failover with a simple Ansible play book (of course!) that performs these basic steps:
1 - On ALL Tower nodes, stop the Ansible Tower services
2 - On the nodes to become active, validate that they can contact the Postgres DB via the load balanced IP (using DNS)
3 - Start the Tower services only on the nodes in the site to be active.
In our testing, the failover playbook works perfectly - it permits us to quickly bounce the Tower services if necessary (a recent DNS outage caused issues), and it ensures that the operations team does not have multiple Tower clusters running at the same time.
As mentioned above, this configuration is not specifically supported by the general RedHat/Ansible support team, but we’re working with them as an enterprise-class DR option.
Over time we plan to automate the failover and recovery of the Tower services on-server and between sites to enable us to perform maintenance and reduce developer impact due to outages. Of course this will require RedHat/Ansible support to approve and support this configuration - or we’ll have to refactor our configuration to become supportable and adjust our DR/failover/automation expectations accordingly.