multiple instance for AWX instalation (HA)

How can i install a high available version of AWX? I want to test a fail over scenario where the current installation server is down, but i should be able to access to another instance.

Any tip?

I have been working an openshift/kubernetes based scalable system under a branch called scalable_clusters on the awx github repo.

For traditional redundancy, I’m afraid that’s not a focus for this at the moment.

Is it possible to install two versions and point them to the same DB instance?

Bruno Casano (brunito@gmail.com) said:

Is it possible to install two versions and point them to the same DB
instance?

Same database *server*? Yes. Same AWX database *on* the server? Absolutely not.

Bill

Thanks Bill!

Is it stable and can be tested now?

We are running AWX in a clustered HA environment. But for this, some manual adjustments in the installation roles had been done. Further you need to create a RabbitMQ Cluster. For this we disabled the RabbitMQ containers for the installation and set them up beforehand on all the nodes. After the RabbitMQ cluster was running, we changed the RabbitMQ connection details in the roles.

The following files were changed:

image_build/files/launch_awx_task.sh

awx-manage provision_instance --hostname=$CLUSTER_NODE

awx-manage register_queue --queuename=tower --hostnames=$CLUSTER_NODE

image_build/files/settings.py

CLUSTER_HOST_ID = os.getenv(``"CLUSTER_NODE"``, ``"awx"``)

image_build/files/supervisor_task.conf
command = /var/lib/awx/venv/awx/bin/celery worker -A awx -l ERROR --autoscale=``50``,``4 -Ofair -Q
tower_scheduler,tower_broadcast_all,tower,%(ENV_CLUSTER_NODE)s -n celery@%(ENV_CLUSTER_NODE)s

local_docker/tasks/main.yml
uncomment every rabbitmq container reference

- name: Activate AWX Web Container
...

env:` CLUSTER_NODE: ``“{{ cluster_node | default(‘localhost’) }}”`

```…`

RABBITMQ_USER: ``"awx"` RABBITMQ_PASSWORD: "<password>"` ```RABBITMQ_HOST: “{{ cluster_node | default(‘localhost’)}}” ```RABBITMQ_PORT: ``"5672"
```RABBITMQ_VHOST: ``“awx”`

The same was changed for the AWX Task container

in front of the nodes a HAProxy is running with roundrobin load balacing.

Hi Philipp- I’m currently looking into your solution by separating out RabbitMQ container from the installer . How you handle the postgresdb - you have it installed separately as well or within a container ?

Phillipp - can you comment where you specify the CLUSTER_NODE name . I’m getting errors connecting to local host : amqp://awx:**@127.0.0.1:5672/awx

My HA solution for AWX could be extreme, but here is what I am doing.

  1. HAproxy loadbalancing RabbitMQ, Memcached & Postgresql
  2. RabbitMQ Cluster 3 nodes
  3. Memcached on 3 nodes, active/passive configured on HAproxy
  4. Posgresql HA (Master/Slave) using Patroni, active/passive with automatic failover configured on HAProxy
  5. AWX task/web docker instances.

I am using only a single instances on awx-task/awx-web containers, everything seems to be working fine.

Would want to test multiple loadbalanced awx-task/awx-web contrainers.

dnc92301:

Postgres is running on a seperate node, we have set the Postgres DB connection details in the installation inventory for the application nodes. On these only awx-task, awx-web and memcached are running in a container.
The variable CLUSTER_NODE was added by ourself to the installation inventory and is set to the hostname of the machine where the playbook is executed on. We copy the modified AWX installation directory to a new node, set the correct CLUSTER_NODE variable and run the installation playbook, which will then setup a new cluster member.

Hi Phillipp

I still couldn’t get it to work . It would be great if we can work to resolve this separately . I had rabbitmq installed separately as well as Postgres . I’m getting a bunch of errors with regards to .

INFO success : awx-celeryd-beat entered RUNNING state , process has stayed up for > 1 than 1 seconds.
INFO exited : channels-worker ( exit status 1 ; not expected )

I think the problem has to do with celeryd -

ps -ef|grep celery - shows that - celery@localhost - where it should be mapped to celery@hostname (on a working Tower installation )

Thanks again

Earlier issue was due to misconfiguration at the external database server which was fixed .

I’ve got 2 AWX instances running but getting connection refised . Consumer: cannot connect to amgp://awx:**@127.0.0.1:5672/awx [Errnk 111] Connection refused .

Hi dnc92301,

sorry for my late response, somehow the notification is not working properly. The issue is probably, that your RABBITMQ_HOST environment variable inside the container is not set properly set. At the moment it tries to connect against your local container network. As rabbitMQ has been moved out of the container context you need to set the RABBITMQ_HOST environment variable to your host where RabbitMQ is running on. We have set it to the FQDN of the RabbitMQ Host.

In the local_docker role local_docker/tasks/main.yml you can set those environment variables like this:

env:` CLUSTER_NODE: ``“{{ cluster_node | default(‘localhost’) }}”`

```…`

RABBITMQ_USER: ``"awx"` RABBITMQ_PASSWORD: "<password>"` ```RABBITMQ_HOST: “{{ cluster_node | default(‘localhost’)}}” ```RABBITMQ_PORT: ``"5672"
```RABBITMQ_VHOST: ``“awx”`

You have to set them both for the awx_web and awx_task container image. If you have any further issues, let me know.

Thanks Phillipp -

I see 2 references of cluster_node within env both needs to be set to “cluster_node = hostname.fqdn” - And this need to be set within INVENTORY file ?

env:
      CLUSTER_NODE: "{{ cluster_node | default('localhost') }}"
      ...
      RABBITMQ_USER: "awx"
      RABBITMQ_PASSWORD: "<password>"
      RABBITMQ_HOST: "{{ cluster_node | default('localhost')}}"
      RABBITMQ_PORT: "5672"
      RABBITMQ_VHOST: "awx"

Yes, we have set this in the inventory file.

Hi Phillipp - can you provide the specific tag where you had HA working? is using the latest 1.0.4.*? or previous release.

Thanks again.

Here’s the error msg I’ve been getting -

2018-03-08 03:17:34,613: ERROR/MainProcess] Unrecoverable error: AccessRefused(403, u"ACCESS_REFUSED - access to exchange ‘celeryev’ in vhost ‘awx’ refused for user ‘awx’", (40, 10), ‘Exchange.declare’)

Hi dnc92301,

we are currently using the release 1.0.2. But I thing it should also work with later release on a fresh installment. The error you got looks like a connection issue against RabbitMQ. Have you set up the user AWX in your RabbitMQ cluster?

We have set up the RabbitMQ Cluster with the following commands:

[root@host rabbitmq]``# rabbitmqctl delete_user guest
[root@```host rabbitmq]# rabbitmqctl add_vhost awx` `[root@````host` `rabbitmq]# rabbitmqctl add_user awx [root@host` `rabbitmq]``# rabbitmqctl set_permissions -p awx awx ".*" ".*" ".*"` `[root@host rabbitmq]``# rabbitmqctl set_policy -p awx ha-all “.*” ‘{“ha-mode”:“all”,“ha-sync-mode”:“automatic”}’`

Hi Phillipp ,

Yes it looks like it’s working somewhat now after setting the proper permission for the awx user . However , how do you have ensure cluster is configured correctly . Under instance group , do you see both nodes within the cluster . Right now I only see 1 node . I’ve tried kicking off a job and then reboot the server in the middle of the run , does the job fails over automatically to the other node within the cluster ?

Thanks