AWX HA

Hello All,

Apologies if this should go under a different mailing list.

Wanted to know when setting up AWX, is it possible to:

  1. Specify the path where AWX will install too?

  2. Point to an external RabbitMQ Cluster? We have our own and it would be much simpler if we can just connect to one.

  3. Point to an external PostgreSQL Cluster? Same reason as above.

Cheers,
TK

Since AWX is installed as a container #1 isn’t super relevant?

  1. No

  2. Yes see: https://github.com/ansible/awx/blob/devel/installer/inventory#L63-L69

Hey Matthew,

So here’s some background. We’re attempting to set AWX up w/ HA. So instead of internally hosted single MQ or Postgres Docker images, we want to connect them to Highly Available PostgreSQL and RabbitMQ clusters we have in containers or otherwise. What this means we only have to take care of the Docker/AWX GUI container and not the single RabbitMQ and single PostgreSQL containers, should issues arise within it.

Having said that, I would only need to cater to the AWX container and any number of instances of it on separate hosts for redundancy.

In regards to keeping everything in the Docker containers, how can I achieve HA via 3 redundant AWX instances assuming everything is kept in the Docker containers? Assume OpenShift, Kubernetes are not available and I’m using VM’s w/ Docker ( Assuming KVM and VMWare for example ) . Using the various documentation pages available, I’m still having trouble visualizing how the 3 x RabbitMQ instances and 3 x PostgreSQL instances are clustered and talk to each other in the event 1-2 nodes go offline. What replication is used on the 3 instances of PostgreSQL and the 3 instances of RabbitMQ, for example, if everything is kept in Docker Containers.

This all feeds into the supportability realm of things.

Cheers,

TK

Yeah you are pretty far outside of our supported deployment scenario. You’ll need to sort out the postgres and rabbitmq bits on your own, that’s a deep rabbit hole that I solved in a particular way for the Docker deployment and deal with in a different way on K8s/openshift. If those platforms aren’t available to yhou, then you’ll need to roll your own solution. Note that each AWX worker needs to be added to the cluster individually and they’ll need their own queues/exchanges to negotiate work between each other. That only works when all AWX cluster members know about each other and know how to route messages to each other individually, to say nothing of actually configuring rabbitmq’s own clustering You’ve got a good bit of work to do here.

I’ll also caution that any random AWX upgrade is probably going to break you in strange ways that we won’t be able to help with. If you’re okay doing the work to make AWX work and then maintaining that across upgrades then good luck.

postgresql high availability is another matter altogether, and I’ll point out that the AWX installer does not set this up (it only configured a single postgres instance, or requires that you provide connectivity to an external postgres that you maintain)… it’s not really a big deal to set up streaming replication on your own. There are other off-the-shelf solutions for automatic failover and problem detection but that’s out of scope of this mailing list.

Hey Matthew,

Ok, noted all that.

I’ve got most of all these bits except the AWX piece actually. Here’s what it looks like so far.

RabbitMQ Cluster is ready and tested.PostgreSQL 10 w/ Patroni is also up and running.

The two above clusters were relatively easy. The single instance of AWX I have running connected fine to the PostgreSQL 10 Cluster via the VIP I’ve setup. It created tables without issues. What remains:

  1. Point AWX to the RabbitMQ Cluster.
  2. Confirm that subsequent AWX container’s / instances can use the same PostgreSQL DB.

Now you’ve mentioned “Note that each AWX worker needs to be added to the cluster individually and they’ll need their own queues/exchanges to negotiate work between each other.” By ‘their own’ do you mean a shared queue/exchange account on the RabbitMQ instance / cluster?

Where does AWX keep the RabbitMQ connection details after install? I’d like to just change that to point to the RabbitMQ cluster and see how that works.

My future upgrades would be to just install net-new instances of AWX from the latest git branch, migrate the DB to another on the same cluster and start things up again. I’ll usually stick a VIP via Haproxy / Keepalived on top of the AWX instances and set a friendly name via IPA over that.

Cheers,
Tom

PS: FWIW, here’s the cluster setups I’ve used: http://mdevsys.com/wp/postgres-sql-ha-cluster-quick-start-guide/ , http://mdevsys.com/wp/install-rabbitmq-in-high-availability/

I notice the AWX task container points to the AWX RabbitMQ docker instance already like so:

RABBITMQ_ENV_RABBITMQ_DEFAULT_PASS=guest
RABBITMQ_ENV_RABBITMQ_DEFAULT_USER=guest
RABBITMQ_ENV_RABBITMQ_DEFAULT_VHOST=awx
RABBITMQ_ENV_RABBITMQ_ERLANG_COOKIE=cookiemonster
RABBITMQ_ENV_RABBITMQ_GITHUB_TAG=v3.7.4
RABBITMQ_ENV_RABBITMQ_GPG_KEY=0A9AF2115F4687BD29803A206B73A36E6026DFCA
RABBITMQ_ENV_RABBITMQ_HOME=/opt/rabbitmq
RABBITMQ_ENV_RABBITMQ_LOGS=-
RABBITMQ_ENV_RABBITMQ_SASL_LOGS=-
RABBITMQ_ENV_RABBITMQ_VERSION=3.7.4
RABBITMQ_HOST=rabbitmq
RABBITMQ_NAME=/awx_task/rabbitmq
RABBITMQ_PASSWORD=guest
RABBITMQ_PORT=5672
RABBITMQ_PORT_15671_TCP=tcp://172.17.0.2:15671
RABBITMQ_PORT_15671_TCP_ADDR=172.17.0.2
RABBITMQ_PORT_15671_TCP_PORT=15671
RABBITMQ_PORT_15671_TCP_PROTO=tcp
RABBITMQ_PORT_15672_TCP=tcp://172.17.0.2:15672
RABBITMQ_PORT_15672_TCP_ADDR=172.17.0.2
RABBITMQ_PORT_15672_TCP_PORT=15672
RABBITMQ_PORT_15672_TCP_PROTO=tcp
RABBITMQ_PORT_25672_TCP=tcp://172.17.0.2:25672
RABBITMQ_PORT_25672_TCP_ADDR=172.17.0.2
RABBITMQ_PORT_25672_TCP_PORT=25672
RABBITMQ_PORT_25672_TCP_PROTO=tcp
RABBITMQ_PORT_4369_TCP=tcp://172.17.0.2:4369
RABBITMQ_PORT_4369_TCP_ADDR=172.17.0.2
RABBITMQ_PORT_4369_TCP_PORT=4369
RABBITMQ_PORT_4369_TCP_PROTO=tcp
RABBITMQ_PORT_5671_TCP=tcp://172.17.0.2:5671
RABBITMQ_PORT_5671_TCP_ADDR=172.17.0.2
RABBITMQ_PORT_5671_TCP_PORT=5671
RABBITMQ_PORT_5671_TCP_PROTO=tcp
RABBITMQ_PORT_5672_TCP=tcp://172.17.0.2:5672
RABBITMQ_PORT_5672_TCP_ADDR=172.17.0.2
RABBITMQ_PORT_5672_TCP_PORT=5672
RABBITMQ_PORT_5672_TCP_PROTO=tcp
RABBITMQ_USER=guest
RABBITMQ_VHOST=awx

And the host is set using the ‘rabbitmq’ variable:

…/tools/docker-compose.yml: RABBITMQ_HOST: rabbitmq

Seems like changing this to to my external host might do the trick? :wink: Haven’t had the time to look where rabbitmq is set from.

Cheers,
TK

Managed to disable the Docker RabbitMQ and point it to my RabbitMQ Cluster by making these changes:

  1. Delete all previous containers (ie cleanup environment)
  2. Remove all images.
  3. Stop docker.
  4. Add parameter rabbitmq_host= . In my case rmq-c01 .
  5. Execute: ansible-playbook -i inventory install.yml
  6. Start the GUI: http://awx-m01.nix.mds.xyz/

NOTE: I did not clear the DB. I wanted to see if it will pick up the earlier one the installer created.

  1. Need to test further.

`
[root@awx-m01 installer]# cat inventory |grep -v “#”
localhost ansible_connection=local ansible_python_interpreter=“/usr/bin/env python”

[all:vars]

dockerhub_base=ansible

awx_task_hostname=awx
awx_web_hostname=awxweb
postgres_data_dir=/tmp/pgdocker
host_port=80

docker_compose_dir=/var/lib/awx

pg_hostname=psql-c01
pg_username=awx
pg_password=awxpass
pg_database=awx
pg_port=5432

rabbitmq_host=rmq-c01
rabbitmq_port=5672
rabbitmq_vhost=tower
rabbitmq_username=tower
rabbitmq_password=‘password’
rabbitmq_cookie=rabbitmqcookie

admin_user=admin
admin_password=password

create_preload_data=True

secret_key=awxsecret
[root@awx-m01 installer]#
`

`
[root@awx-m01 awx]# git status

On branch devel

Changes not staged for commit:

(use “git add …” to update what will be committed)

(use “git checkout – …” to discard changes in working directory)

I tested the queue connectivity by running a receiver on AWX WEB and a sender on AWX TASK. Results:

`
[root@awx ~]# ./r-rmq.py
[*] Waiting for messages. To exit press CTRL+C
Received ‘Blah! Blah! World!’
Received ‘Blah! Blah! World!’
Received ‘Blah! Blah! World!’
Received ‘Blah! Blah! World!’
Received ‘Blah! Blah! World!’

`

`
[root@awxweb awx]# ./s-mq.py
Sent ‘Blah! Blah! World!’
[root@awxweb awx]# ./s-mq.py
Sent ‘Blah! Blah! World!’
[root@awxweb awx]# ./s-mq.py
Sent ‘Blah! Blah! World!’
[root@awxweb awx]#

`

Still need more testing to be sure that it’s 100%.

Hi Tom,

I am trying to do get clustering working with docker also, just that I don’t need an external RabbitMQ. I am trying to get the rabbitmq containers on each node in the same cluster. I managed to get the awx_web and awx_task in the cluster. I am wondering how did you install it. In my docker compose file I do not see that many variables. I see just
RABBITMQ_USER: guest
RABBITMQ_PASSWORD: guest
RABBITMQ_HOST: rabbitmq
RABBITMQ_PORT: 5672
RABBITMQ_VHOST: awx

Hi Tom,

i’m trying to replicate what you did as a test. Just connecting 2 containers to a single rabbitmq host (i’ll set up a clustered rabbitmq for production later). When spinning up the AWX web container I get an error:

`

fatal: [localhost]: FAILED! => {“changed”: false, “msg”: “Error creating container: 500 Server Error: Internal Server Error ("{"message":"Could not get container for rabbitmq"}")”}

`

I believe I modified all of the files as specified above, do you have any idea what might be causing this error?

I am soon writing up a playbook for building HA in latest AWX version, since there are so many threads in here related with this and most of the informations are scattered across. This is what all we need if are looking for HA in latest AWX version ( V 3.0.1) in docker on linux ,

  • Make the DB centralized either as a Docker container / PaSS / external .
  • Build a RabbitMQ cluster across all nodes and get rid of the rabbitmq container deployment.
  • Update the necessary user permission and policy to the rabbitmq cluster.
  • Build a custom image with the appropriate changes in the awx_web & awx_task container and run it with docker and point this to the cluster running on their respective node. There are no changes in memcache container.

PS: We don’t have to make any changes for celery worker in latest AWX since it is able to pickup and execute jobs as per the node we defined under instance group.

By referring Ansible tower latest installation roles could help to understand how these all are works in Enterprise version and not in AWX. It is a hidden gem and we need to explore our own.

These all are the links helps if someone wish to explore,

https://docs.ansible.com/ansible-tower/latest/html/administration/clustering.html#job-runtime-behavior

https://releases.ansible.com/ansible-tower/setup/?extIdCarryOver=true&sc_cid=701f2000001OH7YAAW

https://groups.google.com/forum/m/#!topic/awx-project/-YABJ1hA3XI

Stay tuned.