modelling inventory variables

Hi list,

​TL;DR: I’d like to know how people model their inventory data for a large set of hosts (+500 vm’s) that are given the mostly the same role, but with many varying applications parameters, to the extent where a simple with_items list or even with_nested list doesn’t satisfy anymore.

I have been pondering some time on the subject at hand, where I’m hesitant if the way I started working with ansible and how it growed over time, is the best possible way. In particular on how to model the inventory
​,
variables, but obviously also in the way implement​ing​
and nest​ing​
groups.

Rather than showing how I did it, let me explain some of the particulars of this environment, so I can ask the community “how would you do it?”

We’re mostly a Java shop, and have a very standardized, and sometimes particular setup:

  • 75% of all hosts (vm’s) are tomcat hosts (I’ll focus on just those from here);

  • every specific tomcat setup is deployed as two nodes (not a real cluster, but mostly stateless applications behind a loadbalancer);

  • every cluster typically has 1 application (1 deployed war with 1 context path in tomcat speak, basically providing http://node/app );

  • occasionally a node/cluster will have more than one such ‘application’ hosted. This can be on the same Tomcat instance (same tcp port 8080), but could also be living on another port (which calls the need for a separate ip/port combination or pool on the load balancer)

  • every application cluster typically is part of a larger application which can vary from one to several application clusters

  • the big applications are part of a project, a project is part of an organisation

  • every application has three instances in each environment: development, testing and production (clustered in the same way, everywhere)

  • the loadbalancer performs typically one, but sometimes more, health checks
    ​per
    application (a basic GET, and checking a string in the response), and will automatically mark a node as down if that fails

  • some applications can communicate with some other applications if need be, but only by
    ​communicating through
    the loadbalancer; this is also enforced by the network;​ so​
    we need a configuration here that says ‘node A may communicate with node B’; we do that on the load balancer​ at the time, and every such set needs a separate LB config;

  • every application is of course consumed in some way or another, and is defined on the load balancer (nodes and pools and virtual servers in F5 speak)

Yes, this means every tomcat application lives on, in total, 6 instances (2 cluster nodes x 3 environments), hence 6 virtual machines

A basic inventory would hence show as:

all inventory

_ organisation 1
_ project 1
_ application 1
_ dev
_ node 1
_ node 2
_ test
_ …
_ prod
_ …
_ application 2
_ …
_ project 2
_ …
_ organisation 1
_ …

Some other implented groups are:

_ development
_ organisation1-dev
_application1-dev
_ testing
_ production

or

​- ​
tomcat

​ ​

_ application1​​

​ ​

_ application2

​-
<some_other_server_role_besides_tomcat>

​ ​

_ application7

​ ​

_ application9

Our environment counts around 100 applications, hence 600 vm’s at this moment, so keeping everything rigorously standard is very important.
​Automating the load balancer from a config per application has become​ a key issue1

​So w
hen looking beyond the​purely per ​
groups and node inventory, on a node we get following data important to configure things on the load balancer:

  • Within an application server:

node

_ subapp1
_ healthcheck1
_ healthcheck2
_ subapp1

​ …​

"* 75% of all hosts (vm’s) are tomcat hosts (I’ll focus on just those from here);

ok

  • every specific tomcat setup is deployed as two nodes (not a real cluster, but mostly stateless applications behind a loadbalancer);
  • every cluster typically has 1 application (1 deployed war with 1 context path in tomcat speak, basically providing http://node/app );

this sounds somewhat like serving multiple customers or different variations on a project from a shared infrastructure? i.e. AcmeCorp and BetaCorp? This seems to imply groups here to me so far.

  • occasionally a node/cluster will have more than one such ‘application’ hosted. This can be on the same Tomcat instance (same tcp port 8080), but could also be living on another port (which calls the need for a separate ip/port combination or pool on the load balancer)

This seems to imply each node/cluster has a playbook that defines what groups get what roles. If you want to generate those, that could be reasonable depending on use case.

  • every application cluster typically is part of a larger application which can vary from one to several application clusters
  • the big applications are part of a project, a project is part of an organisation

AWX is pretty useful for segrating things and permissions between organizations, if you’re talking about access control. Can be useful. Just throwing that out there.

  • every application has three instances in each environment: development, testing and production (clustered in the same way, everywhere)

This seems like you might want to maintain three seperate inventories, that way “-i development” never risks managing production and there is no crossing of the streams (assuming people have seen Ghost Busters)

  • the loadbalancer performs typically one, but sometimes more, health checks
    per
    application (a basic GET, and checking a string in the response), and will automatically mark a node as down if that fails

  • some applications can communicate with some other applications if need be, but only by
    communicating through
    the loadbalancer; this is also enforced by the network;so
    we need a configuration here that says ‘node A may communicate with node B’; we do that on the load balancerat the time, and every such set needs a separate LB config;

  • every application is of course consumed in some way or another, and is defined on the load balancer (nodes and pools and virtual servers in F5 speak)

Seems unrelated to the above bits (or at least not complicating it).

Summary of my suggestion:

  • groups per “customer”
  • seperate inventories for QA/stage/prod
  • define role to server mapping in playbooks, which you might generate if inventory is a source of such knowledge
  • roles of course still written by hand

I have a similar setup, just a bit smaller, but I've taken the looping and
complexity into the configuration templates vs the ansible tasks.

I don't know if this helps you much, but I found that a bit of complexity
in the jinja templates goes a long way and executes much faster than
putting it into tasks.

We’re somewhat similar to you in size and complexity… we don’t have org and proj layers though, we’re flatter with app-type-env, mostly…

My 2c…

  • Use Ansible roles (of course)

  • Use the group_vars directory for vars, as opposed to passing the vars into the role directly, much easier to mange and track changes to envs. (also easy to parse for generating docs of what connects to what)

  • Databases, loadbals, firewalls get their own groups too, just like your app servers.

  • Deploying a new app means you need to link everything together by editing the correct group_vars files for the database, loadbal, app and firewall. Then run the playbooks in the right order. (Obviously there’s room for automation here)

  • Little known feature -i will cause ansible to use all the files and scripts in the dir for the inventory (very useful!)

  • Lists of associative arrays in group_vars files are quite nice for managing accounts, ACLs and other things you need to keep on adding to.

HTH

​​

* occasionally a node/cluster will have more than one such 'application'
hosted. This can be on the same Tomcat instance (same tcp port 8080), but
could also be living on another port (which calls the need for a separate
ip/port combination or pool on the load balancer)

This seems to imply each node/cluster has a playbook that defines what
groups get what roles. If you want to generate those, that could be
reasonable depending on use case.

In this case, there is just one playbook with one set of roles, to deploy
tomcat and all.
The variations in application happens in inventory/group variables​​.​​

* every application has three instances in each environment: development,
testing and production (clustered in the same way, everywhere)

This seems like you might want to maintain three seperate inventories,
that way "-i development" never risks managing production and there is no
crossing of the streams (assuming people have seen Ghost Busters)

  ( ​/me puts on his proton pack​ )

Well, to defeat the marshmallow man, you need to cross them.​​

Avoiding to run something on dev instead of production means you have to
remember to target the right inventory; here I have to remember to run with
the right --limit. Same issue, just a different cli option.
Also, in some cases I need to run things on hosts in different
environments, so a total separation is not possible.

* the loadbalancer performs typically one, but sometimes more, health

checks
per
application (a basic GET, and checking a string in the response), and
will automatically mark a node as down if that fails
* some applications can communicate with some other applications if need
be, but only by
communicating through
the loadbalancer; this is also enforced by the network;
  so
we need a configuration here that says 'node A may communicate with node
B'; we do that on the load balancer
at the time, and every such set needs a separate LB config;

* every application is of course consumed in some way or another, and is
defined on the load balancer (nodes and pools and virtual servers in F5
speak)

Seems unrelated to the above bits (or at least not complicating it).

Well, actually, here is where it gets more complicated, and where I
struggle the most. The above was just to give a clear idea of the
environment.

Putting the config for this loadbalancer here in the inventory ​​and
choosing a variable model to use with the tasks/modules I have evolves to
something too deeply nested.

So far I have this model, and am able to configure up until pools and
monitors:

default_publishedapps:
- name: "web"

  type: "{{ default_apptype }}"

  port: 8080

  lbport: "{{ default_lbport }}"
  monitortype: "{{ default_monitortype }}"

  quorum: 0

  monitors:

  - name: "{{ default_monitor_appname }}"
    type: http

    get_path: "{{ default_get_path }}"

    protocol: "{{ default_protocol }}"

    get_extra: "{{ default_get_extra }}"

    receive: "{{ default_receive_string }}"

    monitorname: web
#- name: tcp

# type: tcp

# port: 1234

# lbport: 601234

# monitortype: "{{ default_monitortype }}"

# quorum: 0

# monitors:

# - name: "tcp"

# type: tcp_half_open

# send: ""

# receive: ""

This works with the subelements plugin.

At this point, I now need a way to say

" App X needs to be defined in a loadbalancer virtual proxy, and be
accessible to node Z "

And then I need to define these proxies, and for this I need to loop
through

- settings from the former list off applications on a host
- settings from the latter list of which applications to define and make
available to which other hosts
- and use network settings from those other hosts
- and all this *could* cross environments.

I didn't implement this yet (still needs work on the virtual proxy
module), but thgis is where I'm hesitant bout how to move forward.
I feel I'd need some extra's in ansible to get to this in a clean way,
possibly
- nesting lookup plugins
- have a way to create new lists doing things like:
  - set_fact: ....
    with_items: .....
   Where the registered result is a list of all iterations.

But I might miss other solutions.

Summary of my suggestion:

* groups per "customer"
* seperate inventories for QA/stage/prod
* define role to server mapping in playbooks, which you might generate if
inventory is a source of such knowledge
* roles of course still written by hand

I think I may say this is the basic usage for most things ansible, and I'm
well aware of these practices already :slight_smile:

Thanks!

Serge

​Yes! I 100% agree. As templates can have quite some logic, one can leave a
abig part of complexity with them.

Alas, not everything can be configured wit a text file. Here I work on a
proprietary load balancer with an API, and specific modules (the big ip
stuff)

Thanks,

Serge​

- Use Ansible roles (of course)

Obviously :)​​ But ansible play syntax related things are not really an
issue here (except perhaps how far I can iterate through things)

- Use the group_vars directory for vars, as opposed to passing the vars

into the role directly, much easier to mange and track changes to envs.
(also easy to parse for generating docs of what connects to what)

As our environment is mostly 1 application type, everything must be
parametrized in inventory, I can't afford to hardcode things in playbooks
here. So, yes.
​​

- Databases, loadbals, firewalls get their own groups too, just like your
app servers.​​

- Deploying a new app means you need to link everything together by editing

the correct group_vars files for the database, loadbal, app and firewall.
Then run the playbooks in the right order. (Obviously there’s room for
automation here)

As of now, they are just delegated hosts, not really part of the inventory,
as i see it, the config of the loadbalncer depends on data from the nodes,
data that should be part of that node.
I don't really like the idea to have certain data about certain
applications, part of a node, be linked directly to a separate host.
But maybe that's part of the reason I complicate things? Not sure.
​​

- Little known feature -i <directory> will cause ansible to use all the
files and scripts in the dir for the inventory (very useful!)

I already heavily split up things in different subdirectories :slight_smile: Which has
drawbacks however, but that's another story.
​​

- Lists of
​​
associative arrays in group_vars files are quite nice for managing
accounts, ACLs and other things you need to keep on adding to.

Can you elaborate on what exactly you mean by this? By

associative arrays?

Thanks,

Serge​​

I would think so, the data is still part of your node logically, even if it’s split up between files so located it’s where it’s being used.

We don’t actually split up our inventories, we just use one, and then always use —limit to control which hosts it get’s applied to. Other than some base os type playbooks, we have no use case where we’d run all our playbooks over all hosts, we only do very specific playbook runs.

e.g.
inventory/group_vars/tag_Role_my_db_cluster_01:

my_db_users:

  • db: database1
    login: app1
    pass: secret
    perms: rw
    ….

  • db: database2
    login: app2
    pass: secret
    perms: ro

role/dbcluster/tasks/main.yml:

  • dbmodule: database={{item.db}} name={{item.login}} password={{item.pass}} perms={{item.perms}} …
    with_items: my_db_users

Also, the above syntax for my_db_users scales nicely if you have long values and a lot of them per entry.

Hi,

For who is further interested in this discussion, allow me to link to my presentation on http://cfgmgmtcamp.eu/ on this topic:

https://speakerdeck.com/svg/modelling-infrastructure-with-ansible-inventory-data

Without the talk itself, this presentation is not fully informative, but I’m happy to further discuss it, or to receive private mail on it, if you find that need.

Serge

Yeah maybe start a new thread and fill in between the lines maybe?

My understanding from private conversations was you had some possible ideas for upgrades.

I also think you are probably going to be interested in writing your own classifier because you have some cross-cutting and modelling concerns that might not be generic, but also had some ideas around some small tweaks that could be made to the INI parser
and also some efficiency thoughts around vars_plugins (which are an internalisms and not really intended to be user-serviceable), but I’m open to seeing if we can make that better – (benchmarks might help?)

I was actually talking to a customer recently who had a similar set of concerns, but ultimately I think this gets into site-specifics very very quickly, as they basically had a 5-dimensional problem going on and might end up breaking out Neo4j :slight_smile:

We don't actually split up our inventories, we just use one, and then
always use --limit to control which hosts it get's applied to. Other than
some base os type playbooks, we have no use case where we'd run all our
playbooks over all hosts, we only do very specific playbook runs.

I'm generally (theoretically) cautious of what happens if you leave off
--limit in that case if the playbook regularly targets everything. Maybe
make a wrapper script?

Maybe it's not been a thing.

​Yes, I have a wrapper script that queries the inventory, and presents a
subset to be used as parameters​ to said script, which is then used by
developers, who can run certain things (tags) on certain groups
(development etc.)