The image shows, that on each node RabbitMQ is running, which are providing the cluster functionality. With AWX you have on each node then docker containers running with the awx_web, awx_task and memcache images. For each node the settings.py is in this way configured, that the rabbitmq_host points to it one machine. Meaning that when you have two nodes: node1 and node2, that on node1 the settings.py rabbitmq_host points to node1 and on node2 the settings.py rabbitmq_host points to node2. The same goes for the celery worker. Therefore we have added this cluster_node inventory variable. We have the installation directory on each seperate node, set the cluster_node to the machine name and run the installation playbook on each node.
The cluster work in this way, that when the capacity on one node is full, the next triggered job will be started on the next node of the cluster with free capacity. A failover of a job like mentioned in your example will not work, as the playbook run is triggered by one worker process on one node. When you restart this node, the worker process is lost.
The page I provided to you gives you a good overview, of what the cluster functionality of Ansible Tower/AWX provides and how it works. For our setup I have basically reenginered this setup into AWX.
It’s probably worth pointing out at this point that clustered AWX is now supported on Openshift and Kubernetes without needing to hand-roll your own solution.
I did take your suggested recommendation by updating all config files - settings.py and others .
I’m looking at tower HA configuration . API/2/ping it shows both nodes within the instance group . I’m AWX it lists awx within instance_group , ps ef |grep celery - shows celery@awx (on both nodes ) .I think a working setup should be showing as celery@hostname
I Think you are on the right path, when the API already shows both nodes within the instance group.
There is this file:
image_build/files/supervisor_task.conf
In the installation image_build role. If you adjust the celery worker command to: command = /var/lib/awx/venv/awx/bin/celery worker -A awx -l DEBUG --autoscale=``50``,``4-Ofair -Q tower_scheduler,tower_broadcast_all,tower,%(ENV_CLUSTER_NODE)s -n celery@%(ENV_CLUSTER_NODE)s
ps ef | grep celery should also show the correct output. Please be aware of the ENV_ in front of the environment variable, this is needed by supervisord for reading environment variables.
It seems the variable is not being passed corrected in the supervisor_task.conf file . ...tower_broadcast_all,tower,%(ENV_CLUSTER_NODE)s - n celery@%(ENV_CLUSTER_NODE)s .
As celeryd starts up as ,tower_broadcast_all,tower,awx -n celery@localhost .
In tower , it is *tower_broadcast_all,tower,hostname -n celery@hostname
Within inventory i tried with cluster_node , wig and without FQDN and also with cluster_node and CLUSTER_NODE.
This was set for the awx task and web container, providing the cluster_node variable to the environment. So supervisor_task.conf file can read this inside the container.
Hi Phillipp - can you tell me what version /tag you used for your setup ? I wanted to see if I can reproduce it . I had done exactly the changes you had described by separating out Rabbitmq (2-nodes) and having containers(awx/task/memcache) installed (with the modified configurations) on the same hosts running rabbitmq. PostgreSQL on a different node . So in all a 3 server configuration. Thanks
I am having trouble finding the parts of the file you edited to accomplish a multiple instance (HA) installation. Current version is 3.0.1
which command did you replace in supervisor_task.conf with "command = /var/lib/awx/venv/awx/bin/celery worker -A awx -l ERROR --autoscale=``50``,``4-Ofair -Q tower_scheduler,tower_broadcast_all,tower,%(ENV_CLUSTER_NODE)s -n celery@%(ENV_CLUSTER_NODE)s"?
In addition, I do not see any references to rabbitmg in main.yml or set_image.yml (the only task imported in main.yml) in local_docker/tasks/ .
This particular change realted with celery isn’t needed for latest AWX version 3.0.1 as it auto picks up and schedule jobs under what instance group we defined over there.