New user here trying to figure out the best way to convert our current server provisioning system to Ansible. Our system uses approx. 5 different attributes to provision each server and we have about 1,000 servers. I’m wondering whether we could get by by using Ansible’s built-in mechanism for support “groups” and variables in “group_vars”. That would certainly be the easiest way…just not sure it would scale well at all.
I’m estimating about 100 different “groups” based on all combinations of these attributes. For example, assuming we have about 40 different groups corresponding to playbooks (webserver, dbserver, appclient), 40 different “projects” (managing root passwords and access), 8 different “locations” (managing things like ntp server settings).
Is anyone out there doing something like this? My worries are:
Scalability. Can Ansible handle this? What about 10k servers? The inventory script will contain roughly 100 different groups, totaling about 5,000 server entries (1k servers * 5 groups)
Maintainability. The group_vars directory will probably contain 100+ files. The all.yaml file itself will probably be hundreds of lines long.
Managing group conflict. What happens when someone puts the “ntp_server” setting, which is supposed to be in a site-specific yaml file, is put inside one of the project-specific yaml files? According to the documentation, the last file alphabetically gets precedence. That’s really not acceptable, but I don’t know another way to do it.
Summary: looking for people with real-world Ansible experience who may be dealing with a similar setup.
I don’t have as many servers as you to manage, but I do have multiple locations (which i express as different inventory) and use groups with child groups and 20 or so server types. Another difference between my situation and yours is I started from scratch and didn’t really have anything I could migrate into group_vars.
One trick I use to minimize the amount of repetition is to have some groups which are used across different inventories, and others which are only used inside each inventory. While this sounds like it might be a pain its actually very easy to maintain as the inventory-specific group vars can all have the same keys, its only the group name which has to be different (in one location in the inventory and the dir of the group name).
Also I’d suggest going through your vars and looking to see which ones are actually specific to roles and move them to be role vars or role defaults. This can limiit the ‘blast area’ of a specific var’s influence and allow overrides in playbooks or via -e or include_vars. You might find that could reduce the size of your group_vars/all
I don’t have great advice for maintainability - keep everything in source control, obviously, or managing group conflict as there’s only a small group of us who modify the ansible configuration. Having expressive, (if sometimes long) names for vars can help though. I have seen blog posts from others who have come up with naming conventions for vars, although I haven’t felt the need for this yet.
I certainly can’t imagine managing without group_vars. I know there can seem like a bewildering number of places you can put vars but I haven’t found this to be a problem in practice.
Not exactly this scale (yet), but we found that using include_vars with dynamically resolved file names (based on groups/properties of hosts) works well inside roles.
Each role pulls in vars from multiple files based on things like inventory_name and group_names. Some roles load a ‘config’ file and then load more vars based on what was in that config.
We are going similar things on not too different a scale (~2000 servers.) We abuse groups heavily. I have the mantra that a fact should only be defined once. So if I have a link between two machines, I don’t want the same link definition in both. This means I use two different things heavily. One is groups and the other is include_vars. Also everything we do is done in roles.
I should preface this with the fact that I spent quite a while figuring out what kind of data model I wanted for defining all the pieces of the data center. Don’t try this approach unless you like staring for long periods of time at whiteboards covered with nested python data structures.
One of the harder things we do is define haproxy configs for all our load balancers (somewhere between 50 and 100 separate services.) This ended up just too complex for a j2 template so I wrote a python action plugin. I have a group called haproxies with a subgroup for each service and sub-sub groups for the load balancers in a given data center. All the defaults are defined in haproxies, then the information needed for the given service’s config are in the haproxies_svx_xxx group_vars. Then there are the haproxies_svc_xxx_yyy where yyy is the data center. These just contain the pointers to the groups that form the production servers that should be pointed to in this data center’s service haproxy config. Finally I have groups for the production servers for a service and subgroups for each set that will be separately referenced in the haproxy config.
I have the ansible.cfg set to merge dictionaries and then have a big haproxy dictionary. Each variable that is defined has a default key and a possible key for each service. The template code looks to see if there is a service specific key and if not then uses the default key. This allows a default plus override model that maintains the single source of truth model.
We also use centrally managed vars files for things that don’t fit the hierarchy. All our BGP definitions (we use it in a number of ways) span multiple groups, so we have a bgp_vars centrally. This one is hand edited and has information that all the other groups that need it pull from. We also go the other way. We take all the information we learned from all the haproxy groups and generate a file with that information organized for other roles to use easily in templates. This is not a source of truth, just a convenient translation of other data.
A few tips on scaling groups. Be religious in prefixing group and role information with the related entity. Never have loghost, always have group_loghost or role_loghost. that way it is still easy to follow but never clashes. This means that roles and groups need to be different. I do this by having roles be singular and groups be plural (haproxy is my role and haproxies is my parent group).
My experience is that the issue of scaling is one of how many systems x how many operations you do. Doing lots of operations on a small set of machines works. Doing a small set of operations on a large set of machines works. trying to configure an entire data center from bare install in a single playbook is probably a bad idea. Other than things like rolling passwords and keys, we have gone to the configure the instances of a type of machine model. Since everything is in roles and inventory definitions, it’s just an issue of listing all the roles that are needed to set up the service.
This is just one way to attack this problem. Hope that helps.