Scaling AWX in a docker-compose/rancher installation

Helllo,

as we were using rancher for managing our dockerized services we also used it for setting up AWX. Fortunately there is an AWX template in the rancher community catalog which is mainly based on the docker-compose installation from the AWX repository as far as I can tell. That worked OK to run AWX with a single instance.

Now as our inrfastructure and with it the number of playbooks to be continuously executed grows we also would like to sclae our AWX deployment by adding new instances. The first thing is to find out what an instance actually means, I did not find something about it in the docs. On a test kubernetes cluster we have seen that when scaling up the whole pod with awx_task, awx_web, memcached and rabbitmq inside is duplicated so I was assuming that these 4 components form an instance. Now I have tried to replicate the same setup using rancher. It worked in a way that I could see a new instance being registered in the instance groups menu. But when actually trying to execute a playbook on it it seems to interact with the existing instance. Either the template executed on a separate instance group with that new instance only is not going to start at all or it stays in pending state for a while and then fails with the error message “Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed.”

The new instance we have set up has a separate memcached, a separate awx_rask and awx_web container and a separate RabbitMQ. It just shares the database with the existing first instance.

So my questions are now:

  1. Is it correct that these 4 separate components (awx_task, awx_web, memcached and rabbitmq) are needed to form a new instance ?
  2. How does the instance to be configured to work properly with the existing instance and interfering each other ?

One of my assumptions was, that it has something to do with the SYSTEM_UUID but then also noticed when scaling up the pod in kubernetes that both instances have the uuid 00000000-0000-0000-0000-000000000000 so it must be something different ?

Anybody is able to help here who maybe has a properly scaling deployment unsing rancher ?

Thanks,

Dirk

I had the same problem. AWX does not support scaling with docker only with kubernetes or openshift.
An instance in the way you describe it is awx_task, awx_web and I think memcahced too. But rabbitmq needs to be a cluster. And the db needs to be the same one for all the instances.

I have implemented a solution to AWX HA ( Instance Group ) via playbook. I am gonna publish the code in my repo: https://github.com/sujiar37. And this has been tested with latest AWX V4.0.0 and working well.

This is what I implemented ,

Here is the inventory we populate with Ansible to add ‘n’ number of hosts into the cluster. All that we need to add new host machine IP address under [awx_instance_group_agent] then execute the playbook, after that the new VM will be automatically added under cluster.

[awx_instance_group_master]

HostA

[awx_instance_group_agent]

HostB

HostC

HostD

Stay tuned and I will update here once it has been published.

Regards,
Sujith

Thanks, I’m looking forward to see more :slight_smile:

Hi,

are the workers showing up and usable for distributing jobs onto them?

Yes, it is. You may please find that piece of information in my repository / check this out : https://github.com/ansible/awx/issues/3627

Regards,
Sujith

Yeah I had read over your git page but I didn’t see any screenshots of attached workers.

The install seems like it was successful but I cannot tell because I am getting “A server error has occurred.” when I load the AWX page. However, I do not think this is an issue with your playbook because I think I got that when I ran the “official” playbook from the AWX github for 4.0.0. 3.0.1 is running fine.

I am testing on a newly spawned CentOS 7.6 vagrant cluster, 1 pg database, 1 master, and 2 “worker” nodes, all in the same subnet.

I can’t confirm whether this was successful or not because of the above error. Looks promising though. A couple pieces of general feedback

  1. you are assuming redhat for the install, change that to centos, aws, redhat in a dictionary. I just removed the when condition in the playbook tjat loads the roles.
  2. I am testing on vagrant boxes loaded on my local laptop – no dns. I had to add the name ↔ IP mapping in /etc/hosts in each machine. Would be cool if the playbook could detect that and add to /etc/hosts or use IP’s when name resolution doesn’t work. (not required, just cool)

As soon as I figure out why I am getting an error loading the AWX login page I will let you know if it all worked. I really like what you did here and the effort you put into this, this looks awesome. THANK YOU. This will be a godsend in the future…just have to figure out that main page issue.

Thank you for your words and suggestions.

The server error usually happens when those task / web containers were not able to connect PG DB correctly. It is worth to check the container logs and see why it is occurring. Make sure that you should have a dedicated PG Node. This playbook has been tested already on Redhat physical machines and it is working as expected.

I would love contribution and it is best to raise suggestion/issue over in my repository rather than here, and we can catch it up from there and make improvements. Between I am trying to implement a CI in that repository with some priority, so that it help peoples to contribute better and see whether the build is successing or not for any PR.

Regards,
Sujith

I figured out why my installation isn’t working, the install script is not creating the schema in the external DB I have defined. It DOES have connectivity and can log in. I logged in manually from the hosts and the containers with . the awx user on the database has rights to create databases and schemas. I create the db awx_4_0_0 in the host with the ‘awx’ user with the public schema owned by ‘awx’ and put those details into the inventory file.

I’ve often used external DBs with AWX so I’m not sure what the issue is. I blew away all of the containers and ran the playbook again, with the same result. I’ll keep trying and post back if I can figure out why this is failing. I don’t see the installation playbook throwing errors at any point.

I believe you were using Vagrant to test the AWX HA playbook. If you could share the following info , I can help you a bit and verify what could be causing this ( I will change the values accordingly for my testing purpose ) ,

  • Hosts information ( inventory/hosts)
  • Group variables ( inventory/group_vars/all.yml)
  • Vagrantfile ( config for Vagrant test setup )

Recently, I have added support for enabling multiple web front end which is mentioned in my repository ( https://github.com/sujiar37/AWX-HA-InstanceGroup#points-to-remember ) and here I am quoting it as well in case if it wouldn’t noticed,

“One cool feature is, you can always perform plug and play with the hosts by using these two awx_instance_group_web & awx_instance_group_task inventory groups. It is all about your desire how many web nodes and task nodes you would like to have since HA doesn’t require to run AWX web container in all nodes.”

Regards,
Sujith

Hi,

appreciate the help, I think this is more centered around AWX4 than your playbook to cluster it. Default AWX 4 (release zip) won’t come up using the default settings or external DB. I set up a new 3.0.1 installation on a new centos 7 server pulled from vagrant cloud.

3.0.1 installed fine. I then deleted the docker containers, entered the DB information for the VM running pgsql and it wouldn’t work. Then with a newly spawned clean VM tried to install AWX 4 with the default settings (containerized pgsql db), with the exact same errors and results.

However, the db doesn’t seem to be the root cause. I noticed docker-compose pip module is required for AWX 4 where it wasn’t required before, In addition, the /etc/hosts files inside the containers are different between 3.0.1 and 4.0.0

these are the same errors I got when I ran your HA installer playbook so I think you’re off the hook here :-).

errors:

`

127.0.0.1 | FAILED! => {
“changed”: false,
“msg”: “argument port is of type <type ‘str’> and we were unable to convert to int: invalid literal for int() with base 10: ‘’”
}
Using /etc/ansible/ansible.cfg as config file
127.0.0.1 | FAILED! => {
“changed”: false,
“elapsed”: 300,
“msg”: “Timeout when waiting for :11211”
}
Using /etc/ansible/ansible.cfg as config file
127.0.0.1 | FAILED! => {
“changed”: false,
“elapsed”: 300,
“msg”: “Timeout when waiting for :5672”
}
ermissionError: [Errno 13] Permission denied: ‘/etc/tower/conf.d/credentials.py’
Traceback (most recent call last):
File “/usr/bin/awx-manage”, line 11, in
load_entry_point(‘awx==4.0.0.0’, ‘console_scripts’, ‘awx-manage’)()
File “/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/init.py”, line 124, in manage
prepare_env()
File “/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/init.py”, line 89, in prepare_env
if not settings.DEBUG: # pragma: no cover
File “/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/conf/init.py”, line 56, in getattr
self._setup(name)
File “/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/conf/init.py”, line 41, in _setup
self._wrapped = Settings(settings_module)
File “/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/conf/init.py”, line 110, in init
mod = importlib.import_module(self.SETTINGS_MODULE)
File “/usr/lib64/python3.6/importlib/init.py”, line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 994, in _gcd_import
File “”, line 971, in _find_and_load
File “”, line 955, in _find_and_load_unlocked
File “”, line 665, in _load_unlocked
File “”, line 678, in exec_module
File “”, line 219, in _call_with_frames_removed
File “/var/lib/awx/venv/awx/lib64/python3.6/site-packages/awx/settings/production.py”, line 85, in
include(settings_file, optional(settings_files), scope=locals())
File “/var/lib/awx/venv/awx/lib64/python3.6/site-packages/split_settings/tools.py”, line 101, in include
with open(included_file, ‘rb’) as to_compile:
PermissionError: [Errno 13] Permission denied: ‘/etc/tower/conf.d/credentials.py’
2019-04-09 18:37:28,022 INFO exited: dispatcher (exit status 1; not expected)
2019-04-09 18:37:28,040 INFO gave up: dispatcher entered FATAL state, too many start retries too quickly
2019-04-09 18:37:28,041 INFO exited: callback-receiver (exit status 1; not expected)
2019-04-09 18:37:29,043 INFO gave up: callback-receiver entered FATAL state, too many start retries too quickly

`

I’ll do some more digging and post an issue on the official AWX github page

Hi,

I have installed AWX package to REDHAT 7 Server , but still I am unable to see GUI/WebInterface through its IP Address.

Could anyone help on this please??

I had apply bit customization through official images to bring up HA in AWX V 4.0.0 which is mentioned in the templates directory. The error what you get more related with official images since they have enabled all sensitive contents as mount in 4.0.0, however I personally don’t use that and it isn’t covered in my playbook as well. Here is the detailed info about that commit: https://github.com/ansible/awx/commit/2b6cf971573185a46950c5a8fa3f9de14ede38ae

And this is what I am using for both web and task containers. I will see if I can make a demo video about the entire run through my playbook, bring up HA , and a job run in a distribution manner using rabbitmq cluster. I will update soon .


---
- name: Build Docker Images in Primary Node
  docker_image:
    path: "/var/lib/awx/build_image"
    dockerfile: "{{ item.file }}"
    name: "{{ [item.name](http://item.name) }}"
    tag: "{{ item.version }}"
    force: yes
  with_items:
    - { file: Dockerfile, name: awx_web_ha, version: "{{ awx_web_tag }}" }
    - { file: Dockerfile.task, name: awx_task_ha, version: "{{ awx_task_tag }}" }
  when: inventory_hostname in groups['awx_instance_group_web']

- name: Build Docker Images in Agent Nodes
  docker_image:
    path: "/var/lib/awx/build_image"
    dockerfile: "{{ item.file }}"
    name: "{{ [item.name](http://item.name) }}"
    tag: "{{ item.version }}"
    force: yes
  with_items:
    - { file: Dockerfile.task, name: awx_task_ha, version: "{{ awx_task_tag }}" }
  when: inventory_hostname not in groups['awx_instance_group_web']

Hi,

Could any one help on below:

I tried to install AWX as below:

git clone https://github.com/ansible/awx.git
#ansible-playbook –i inventory install.yml

But when I am putting IP Address to URL , then it is not displaying anything. Could you please help how to trouble shoot ?

Thanks.

Regards:
Gaurav