Optimizing inventory parsing: vars_plugins need some love

Hi,

I have been diving into the Inventory the last couple of days, and for part of that I wrote a script to dump my - mostly dir/ini based - inventory into a json set that follows the api for dynamic inventory. This made me notice that dumping the inventory, based on dir/ini parsing, then adding some own parsing for the group/host_vars files and then offering that json result set back to ansible through a script that just cat’s the json as part of an inventory script, was way more efficient.

Running a simple ansible -m debug on an inventory subset containing around 600 hosts and 600 groups, standard ansible took around 50s to parse the inventory, whilst my script took around 5s.
The issue at hand is how the parsing in vars_plugins works. (Vars_plugins basically means parsing group_vars and host_vars - as I’m not aware of any other implementations.)

Whilst the groups_vars.py plugin has probably the most work to parse group_vars files, the way it is now implemented, the plugin gets called for every host in the inventory and will parse all files again and again (which makes me think of why the “_meta: hostvars” key was implemented in the script plugin api.)

Previously, I tried implementing caching of the group parsed data in that plugin, but as this gets called as inventory.get_variables(host) from runner(), which forks into several processes, this caching doesn’t do any real good.

Actually, what seems a bit weird now I look at it, is that the group_vars plugin makes it’s own logic on variable precedence and yield vars per host only.

Whilst one would expect that e.g. the data from group_vars/all.yml would end up in the vars parameter of Group(‘all’), it now gets directly worked into the vars section of every host, whilst other parts of the inventory (vars section in ini files and vars section in the json from scripts) remain attached to their respective groups, only to have its precedence calculated later in Host.get_variables()

So, there is definitely lots of improvements to make to this code, but not without some changes into core.

From the top of my head, I think a patch would need to:

  1. at least modify the vars_plugins to return the main group- and host vars, as they are set in the respective files, and let the precedence in the other code parts do it’s job as they already do for other inventory sources
    probably something like moving part of that logic from Inventory._get_variables(host) to Inventory._get_group_variables(group)
  2. But also move that logic totally out of runner code, out of those specific methods, and put it 100% into Inventory code, avoiding late parsing in Runner code.
  3. Eventually - if I dare to ask - I wonder to what degree host/group_vars parsing should remain in a plugin infrastructure, given it’s importance, it’s specifics, and also being the place where vault is implemented.
    Does anyone know of any other vars_plugin implementations that justify keeping vars_plugins? Not that I would recommend to abolish that of course, but i would seriously consider to implement host/group vars into core, and not as a plugin.

Thoughts?

Serge

Hi,

I have been diving into the Inventory the last couple of days, and for
part of that I wrote a script to dump my - mostly dir/ini based - inventory
into a json set that follows the api for dynamic inventory. This made me
notice that dumping the inventory, based on dir/ini parsing, then adding
some own parsing for the group/host_vars files and then offering that json
result set back to ansible through a script that just cat's the json as
part of an inventory script, was *way* more efficient.

This seems fine to load them in all of the inventory scripts, the reason
they are done as a plugin is not important to me -- however I do want to
make sure they are loaded for both Ini and Scripts in code that is part of
core, and not something each inventory script would need to do.

So, there is definitely lots of improvements to make to this code, but not
without some changes into core.

That's good.

I think I'm only interested in seeing upgrades to core.

Doing things on the side fragments everyone and makes future improvements
more difficult because there is less standard to rely upon.

From the top of my head, I think a patch would need to:

   1. at least modify the vars_plugins to return the main group- and host
   vars, as they are set in the respective files, and let the precedence in
   the other code parts do it's job as they already do for other inventory
   sources

I'd argue that we remove group_vars and host_vars from vars plugins
entirely.

   1. probably something like moving part of that logic from
   Inventory._get_variables(host) to Inventory._get_group_variables(group)

Sounds reasonable

   1. But also move that logic totally out of runner code, out of those
   specific methods, and put it 100% into Inventory code, avoiding late
   parsing in Runner code.

Agree!

   1. Eventually - if I dare to ask - I wonder to what degree
   host/group_vars parsing should remain in a plugin in

As per above, no problems ditching this as a plugin, but we should keep the
plugin loader for vars_plugins around (even if undocumented) to not break
people, though the dir would just ship empty or with a noop.py

   1. frastructure, given it's importance, it's specifics, and also being
   the place where vault is implemented.
   Does anyone know of any other vars_plugin implementations that justify
   keeping vars_plugins? Not that I would recommend to abolish that of course,
   but i would seriously consider to implement host/group vars into core, and
   not as a plugin.

Hard to say, but I'd rather not break people if we could help it. A
noop.py (basically an empty dir) would have extremely minimal cost.

​100% +1 from me.

I think we are on the same line for all of this. I'll put that on my todo
list :slight_smile:

  Serge​

As a follow-up here, I just filed

https://github.com/ansible/ansible/pull/6734
“vars_plugins refactor and move into core inventory”

This patch set does a major set of things, with a big performance benefit in the case of very large inventories (tested on one with 750 hosts)

  • Split out parsing of vars files from per host, triggered by runner, to per host and per group parsing, instead of (re)parsing all groups each time for every host. Introduce separate methods to parse a particular group or host and inject this into the proper Inventory objects, instead of parsing all vars for 1 host. This enhances performance.
  • Introduce a noop vars plugin, to prepare to move the group_vars plugin into core.
  • Move inventory.set_playbook_basedir from ansible-playbook to playbook constructor, to enable it to trigger inventory events.
  • Move group/host_vars parsing from vars_plugin code into core inventory. this unifies parsing of all variables from all sources into the Inventory constructor, instead of being called for various sources from various ansible modules at various times. Also move methods to read files from the plugin, to util (including the vault specific code.)
  • This patch until here has the downside to parse the whole inventory with all its vars at Inventory.init() time, which takes some more time due to the group/host_vars files loading. This was however especially a problem (as in noticeable extra time) in the case of a (very) large inventory with hash_behaviour=merge, which is optimized by replacing the deepcopy function, by manual parsing in utils.def merge_hash(a, b):

This also integrates the vault_password as a parameter of the Inventory, which makes a lot of sense :slight_smile:

Happy testing :slight_smile:

Serge

Also, for sake of completeness, https://github.com/ansible/ansible/pull/6718 was a follow up thing here, but given it stands on it’s own, I filed it separately.

Serge

Serge,
Do you have any thoughts on the following:

  • Given a group, determine the original hosts definition from inventory (so that if I define web[001:100] it doesn’t expand into 100 separate hosts when graphing!)
  • Determine the variable names that are defined for a particular group (even better if we can tell that a particular variable overrides that of a parent) - this would help to show what variables come from what groups.

I’ve had a look at the current ansible inventory methods and it seems that group_vars and hosts are expanded too early for the needs of the inventory grapher.

I’m not sure how widely useful such requirements would be, but if it’s easy to integrate into your changes, that’s great - otherwise I suspect variable graphing and expressing hosts as ranges rather than a potentially large bunch of hosts may have to wait until someone has some better ideas than mine (most of mine involve reimplementing bits of ansible.inventory inside ansible_inventory_grapher, which seems fragile to say the least!)

Will

​Hi Will,

Do you have any thoughts on the following:
* Given a group, determine the original hosts definition from inventory
(so that if I define web[001:100] it doesn't expand into 100 separate hosts
when graphing!)

​You mean, in such way that you could just graph one item for this set of
hosts? I'm afraid not. Also, this notation is specific for INI files, there
is no equivalent for scripts/json inventory, so I think it would be a bad
idea to code your scripts with such a particular option in mind.
What you could do, is only graph all hosts "if there aren't too many in a
group". Given a notation web[001:100], those 100 hosts will always be in
one same group.
When looking at graphing the hosts of a particular group, you could look at
Group.hosts, and if that list is bigm e.g. only graph the first item.

* Determine the variable names that are defined for a particular group
(even better if we can tell that a particular variable overrides that of a
parent) - this would help to show what variables come from what groups.

I've had a look at the current ansible inventory methods and it seems that
group_vars and hosts are expanded too early for the needs of the inventory
grapher.

​That is one of the things i address in my refactoring of the vars_plugins,
there will be separate methods to retrieve the vars for a particular host
​and a specifi group, while not retrieving inherited vars from parent groups
(whereas now, vars plugins calculate all inherited vars per host only)

I'm not sure how widely useful such requirements would be, but if it's easy

to integrate into your changes, that's great - otherwise I suspect variable
graphing and expressing hosts as ranges rather than a potentially large
bunch of hosts may have to wait until someone has some better ideas than
mine (most of mine involve reimplementing bits of ansible.inventory inside
ansible_inventory_grapher, which seems fragile to say the least!)

​Indeed, I think yo need to stick with the "public" methods and variables
of the objects.​

On a sidenote, there might be two methods to loop the inventory forest: top
down, starting from 'all' ans looking at childs, or, like you dit, looking
at hosts and their parents. just mentioning this, I'm not sure which
approach is the best.
I can just think of one advantage of the top down approach: only graph a
part of the inventory, limiting the tree to a certain depth. Just thinking
out loud.

When time permits, I might play around a it with your script and see what
options we might have.

  Serge

Due to the way things late-evaluate or override, you’re not going to have the final value of a lot of variables.

I think this is still quite reasonable.