I’m still looking for extensive examples though.
Especially large inventories of at least several hundreds of hosts, dozens of group, and a deeply nested hierarchy.
Or, perhaps why and how you avoid that, as that becomes hard to manage.
My focus is, more specifically, how you manage multiple sets of variables in such an inventory. In my experience, managing group_vars files becomes very cumbersome, and impossible to keep a good overview.
What are the problems you get, when doing inventory at a larger scale?
We don’t have quite those numbers, but I’m happy to share anyways:
We have hosts in 3 datacenters: local, azure, aws. We’re transitioning to pure aws. We use a custom rest service for both azure and aws-based hosts, while “local” hosts are maintained using a good ol’ hosts file (these servers are completely static and don’t change so its fine).
All servers regardless of location have a couple of attributes: servicegroup, servicefamily and application. This produces a “3-level” hierarchy which our custom intentory services expose as ansible groups/subgroups, and we also add these as hostvars.
We have a “global” entrypoint for all playbooks. Whichever server you want to configure you use the same entrypoint playbook. This file is quite simple: It scans a config hierarchy (using include_file) and from there selectively enables roles to execute on the servers that need that role.
the config hierarchy is based on build/testing various paths where there may or may not be a file. If there is one its included, if not its skipped without error. So I might have:
1/prod.yml
1/dev.yml
2/db.yml
3/db_cache.yml
4/webserver_external_django.yml
This allows us to place defaults at the “top” of the hierarchy and override it further down. Inside the last file there might be a flag such as “do_run_role_webserver_external_django:true”
After the config hierarchy is included the rest of the “main” entrypoint playbook basically consist of “include role…when” statements.
This works well for us, but it has a few limitations:
skipping roles actually takes a bit of time in Ansible. We’re easily looking at ansible just sitting there for 2-3 minutes while blazing thru “skip” statements
Our setup makes it hard to coordinate “back and forth” between servers. Our config is based around us always being able to “touch” a single server without touching any others, so its kind of a choice we’ve made, but it might come at a too high cost for others.
As far as the “config tree” goes, we’re looking at replacing it with something else. things we’re considering:
consul KV (we have a working poc of this up and running)
aws ssm param store (supports “pathed” params so should be easily pluggable)
Btw those rest services we use as inventory scripts for both azure and aws essentially make instance tags into groups/subgroups. I guess that was the essence of your question (how to manage groups) and the answer is that we don’t - we use instance tags for that and groups/subgroups are built for us but the inventory service we wrote.
I typically set up my inventory (from most abstract to most specific) with account, environment, group, and name. Account is just the AWS account. Environment is dev/stage/prod, etc. Group is something like “elasicsearch” or “bastionhosts”. And name is the individual server. I have group_vars/host_vars set up for each of these levels to dynamically control behavior. Then I typically have entrypoint scripts named after the group they run on, for example “bastionhosts.yml”. This is usually just a list of roles to run on one or more groups.