We are working the trying to prove out/confirm that Ansible (via AWX/Tower) is able to replace out current drift management system. Technically it is seems very likely as the functionality is there. One of the hurdles we are struggling with is the scalability aspect and the time it takes. We see comments/posts where people “manage 1000’s of machines” (or 10’s of 1000’s of network devices). The simple word “manage” can mean a different things to different people though and the posts we have seen so far do not explain the details of what “manage” meant to them.
The question is: Is anyone using AWX/Ansible for drift management in the manner that we are attempt to or have we fallen off the rocker? if you are, a follow-up would be what did it take to get it to work.
Where we are coming from:
The full policy is run every 15 minutes on every server
There are about 6000 servers
There are 170 “rules” in the current policy
The rules are sometimes very finite (ex: “maintain the this configuration line/item”)
The rules may be broad (ex: “maintain these 10 configuration aspects which somehow are loosely coupled to SSH”)
Going forward with Ansible:
Frequency could be backed off to once an hour
The 6000 servers still all apply
Because of the the current rules including multiple items/aspects, we expect the “rules” to double
Some/many of the “rules” will likely require multiple tasks in Ansible
Not all rules apply to every server. there are many conditions (server OS, Server OS major version, server tier, server location, etc)
With all of this I’d expected 400-500 tasks when we are finally done- We are starting with testing the concept using AWX. TBD if the actual implementation would be AWX or Tower
The contention: It simply takes awhile to complete a single task across all of the servers. We know that a high level of parallelism is going to be needed but that doesn’t seem to be all/enough itself.
Does Ansible, without AWX, meet your timing requirements? Tower is going have the runtime of Ansible + some overhead (< 2x).
There is an exception, where AWX/Tower clustering can help you scale horizontally. The clustering feature + workflows can be used to split inventory across multiple jobs. You can then use workflows to easy run those jobs in parallel across multiple servers.
The Splitting/Sharding concept is one we talked about a little (playbook uses the API to create the sub-inventories and submit jobs against) but kludgy of course and I am not sure how reusable it would have been. This feature hits it on the head though. The forks + task methodology makes it hard without something like this (slow hosts kill the entire fork group; in 6K hosts, not all of them are super fast).
We saw a little talk about clustering AWX and outside of OpenShift, it didn’t look like it was supported/existed. We’ve run into a few gotchas when trying to do it though some of the pointers in the one HA thread here look to be helpful. Is AWX clustering official or only functional within OpenShift?
I don’t think Ansible itself could do it either. Most of our testing has been within AWX itself (interface, reporting, history of job execution - these are soft requirements). I set up a simple test to get a feeling for it:
Playbook: turns off gather_facts and contains 2 tasks:
task 1: uname -n & store in a variable
task 2: display the variable
Sample size: 50 servers
Forks: 16
Ansible runtime*: 27 seconds
AWX runtime: 32 seconds (18.5% slower)
Ansible was run on a physical server with 40 threads; forks was set to 16 still
Calculating this out to 6K servers comes to about 22.5 minutes for Ansible. That doesn’t leave much time for the other 400+ tasks which would probably exist. I’ll perform some more tests (more tasks, bigger server sample) over the next few days and see how play out. Overall we are pretty new at Ansible and still learning it as well.
There should hopefully be a branch of Mitogen vastly more suited to larger runs publicly available soon. Naturally due to the nature of the target, it’s difficult to find real scenarios where good results can be collected, there is a limit to the reliability of data generated by profiling against a test cluster of VMs under no load except for Ansible. If you are running Ansible in a very-many-target environment and are interested in performance, please feel free drop me a message offlist.
I’m mentally struggling with how setting up an completely unrelated infrastructure to run AWX on provides clustering when AWX itself does not but needs to be aware of the cluster members. Doesn’t the (AWX) cluster itself need to know about the other members in it the cluster in order for workload scheduling and all still need to use the same MQ instance(s)? It is not out of the question to setup an infrastructure to run the AWX infrastructure but is a pretty heavy curve learning, setting up and selling the concept of supporting two new infrastructures.
Mitigen looks and sounds interesting and promising also. It is something we could play with now and see the impact/benefit and hope it would play out if/when the existing ruleset is rewritten.
Each AWX instance in Kubernetes is a pod and each pod consists of the following containers awx-web, awx-task, memcached and rabbitmq. They can definitely share a backend postgres database, but seeing as each pod is typically isolated and self contained I think they each have separate instances of rabbitmq and memcached. I can see multiple instances listed within the AWX UI when I increase the number of nodes though so I’m assuming there’s some sort making a record in the database and then maybe using records in the database to broker who takes what job? I’m still a little fuzzy, but will be exploring this further in the next week or so and I can let you know what I find out. One thing to note is that they all appear to be a part of the first tower instance group. So there may be some built in mechanic for splitting jobs amongst instance group members at the database level?
I need to talk to my AWX guru more about this next week. I think one difference may be the separate (not shared) rabbitmq (and memcached?) instances as I don’t think he had those. I also seem to recall him mentioning that when he added additional instances the capacity of the first dropped with it eventually dropping to zero which basically defeated the purpose (“capacity” is not the term he used but I don’t recall what he called it; units of work maybe?). Interested in hearing more as you learn. Thanks for sharing.
I looked at Mitogen with one someone who is more familiar than I am and it looks promising. Nice work. We are hopeful that this will help bring us closer to what we need for drift management (and activities which impact the entire environment). While I didn’t see anything which would be of concern, it will be interesting to see how the various OSes we have are handled (I ran into the AIX SSH bug/pseudo-tty allocation issue when testing over the weekend)
Our AWX environment is currently sick so I have not been able to perform proper side by side tests with it yet but pretty numbers I gathered using strait Ansible:
(serial values are basically a random choice; strategy with default serial took 3 hours 15 minutes)
(not all 7K hosts exist/were accessible but I didn’t count how many were not)