slow startup with many groups in inventory

Harald_Laabs · October 24, 2014, 8:57pm

Hi everyone,

with the update to 1.7 ansible startup times went up to 25 seconds for
me. After some debugging I found not really optimized functions in the
inventory parsing. With more than 1000 servers and 10.000 groups the
current code is not really usable.
Problem:
get_group(self, groupname) in lib/ansible/inventory/__init__.py is on
O(#groups) and is called #group times. O(#groups^2).
Solution:
Do not iterate over the list every time, hashtables should reduce the
total time to O(#groups * log(#groups)).

With this patch I'm down to 50% startup time:
https://github.com/hrld/ansible/commit/6f349c4d09c5003f970981b247e99ede12ad0d6c

Can someone who actually knows python (and ideally knows what the
inventory code does) can check if I need to invalidate the cache in more
places? Or suggest a better way to implement this?
I'd feel better if someone can give feedback before I send a pull
request for this...

If I'm not the only one who feels ansible could be a bit faster to
start and would look at patches I'd appreciate feedback about that.
(There are about 120k getcwd/stat64/lstat64 sytem calls during startup
in my setup. 95% of that could easily be avoided with some caching.)

Thanks!

Jesse_Keating · October 24, 2014, 9:11pm

Changes to make inventory parsing faster is definitely awesome, but to the point of stats, we use an inventory plugin that provides a _meta key in the return json. This is where all the host/group facts are stored, and when ansible sees _meta it’ll avoid doing any stats to read groupvars or hostvars files. This was a HUGE win for us, and I recommend it to anybody that has a large set of hosts or groups.

-jlk

Michael_DeHaan1 · October 25, 2014, 12:14am

Yeah, with what Jesse says, it sounds like your inventory script is forgetting to return “_meta”, as otherwise I don’t know of any inventory speed issues in 1.7.X.

let me know if this is not the case.

This is covered in http://docs.ansible.com/developing_inventory.html#tuning-the-external-inventory-script

Harald_Laabs · October 25, 2014, 7:36pm

Thanks, I missed the _meta option so far. But my test-setup still calls
stat() on everything despite _meta/hostvars. (1.7.2 and 1.8.)
It also uses the hostvars I passed to it. But it does not seem to work
for groupvars at all - you use a plugin/patch for that?

My main problem is actually NOT the syscalls, it is get_group(). I
suspect that would still be called even if _meta would work as intended.
And with 20k groups and O(number_of_groups^2) there is a speed issue.
Somehow it was still reasonable in 1.6 (get_group() is unchanged, so
either it was not called as often or something else changed) and is
slow since 1.7.
The patch
https://github.com/hrld/ansible/commit/6f349c4d09c5003f970981b247e99ede12ad0d6c
does nothing about the stat-syscalls but speeds up get_group() by using
a hashtable instead of a list to look for groups. (n vs. log(n) - with
my n being ~20.000 that helps a lot...)

Assuming that most users will not have group vars and host vars for
more than 10% of their hosts and groups (or less than 1000 of each - so
it would not matter) reading the relevant directories and keep a
hashtable of the existing filenames should make big inventories with
hostvars/groupvars in files much faster. Is there any reason not to do
that?

Best regards,
Harald

Michael_DeHaan1 · October 25, 2014, 10:42pm

When you say it calls stat “on everything” do you mean your inventory script, or something else?

Which files?

Definitely do the “_meta” thing before proceeding further as the overhead of forking N copies of a dynamic interpreter is going to be quite large.

Harald_Laabs · October 26, 2014, 12:34pm

lstat64 is called on $PATH_OF_INVENTORY/host_vars/$IP.* and
$PATH_OF_INVENTORY/group_vars/$GROUP.*, about 14k calls with
"host_vars" and 48k with "group_vars" in it.

Example inventory (returned by script):
{
    "mygroup": {
        "hosts": [
            "127.0.0.1"
        ]
    },
    "_meta": {
        "hostvars": {
            "127.0.0.1" : { "ansible_ssh_port": "1234" }
        }
    }
}

strace ansible mygroup -m ping 2>&1 | grep lstat|grep _vars

lstat64("ANSIBLE/inventory/group_vars/mygroup", 0xbfc09f5c) = -1 ENOENT
(No such file or directory)
lstat64("ANSIBLE/inventory/group_vars/mygroup.yml", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/group_vars/mygroup.yaml", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/group_vars/mygroup.json", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/group_vars/all", {st_mode=S_IFREG|0664,
st_size=71, ...}) = 0 lstat64("ANSIBLE/inventory/group_vars/all.yml",
0xbfc09f5c) = -1 ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/group_vars/all.yaml", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/group_vars/all.json", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/host_vars/127.0.0.1", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/host_vars/127.0.0.1.yml", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/host_vars/127.0.0.1.yaml", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/host_vars/127.0.0.1.json", 0xbfc09f5c) = -1
ENOENT (No such file or directory)

If I understand you and the documentation correctly there should be no
attempts to access host_vars/*.
Ansible is using the ssh-port from the _meta block, but something must
still be wrong here?!

Michael_DeHaan1 · October 27, 2014, 1:41pm

No, if you have host_vars files, ansible will definitely read them.

How many hosts did you have?

Serge_van_Ginderacht · October 27, 2014, 7:52pm

Hi Harald,

Thanks, I missed the _meta option so far. But my test-setup still calls
stat() on everything despite _meta/hostvars. (1.7.2 and 1.8.)

Tha major difference that happened between 1.6 and 1.7, is that
- until 1.6 group/host_vars were only read for the hosts in the run (that
is with a possibke --limit applied), *and* they were only read for new host
that weren't touched at the start of each play.
for big evnrionments this was less optimized, as it -re-read every
group_var file for every host that was member of the respective group
- starting with 1.7, the whole inventory, and *all* host/group_vars are
read at the very start, when initializing the inventory, but eacht
group_vars file now is only read once

For very large inventories - like you have - that first initialilsation
tends to take a but longer, but the full time of running the whole play is
shorter.

Basically, from 1.7.2, every host_vars and group_var file should only get
read once, but they *all* are read once. Given your extremely lareg numbers
of hosts and groups, I can see this might get less performant than it used
to be.

But your issue, a I understand, is not about reading lots of files, but
about iterating through the groups a host is member of, by name, to
retrieve the group object, which is what your patch addresses.,

Given that a host might me member of a *very* large set of groups, I can
see your patch being a good optimization. I'd vote you propose a pull
request for it.

Serge

Harald_Laabs · October 27, 2014, 8:31pm

Correction: _meta was already in the original inventory. No surprise
that it did not speed things up. If ansible will always try to read all
the files there is probably nothing left that can be improved without
patching the inventory-parser code.

Hosts: 3500
Groups: 12000
Host entries in groups: 60000

lstat() calls that could be optimized away by reading the two directories
and caching the list of existing files:
(3500+12000)*4
So if nobody sees a good reason to leave it the way it is I'll also
write a patch for that and should get the runtime for a single
ansible-ping down from 25 seconds to about 10 seconds in our case.
(13 seconds with the current patch.)

Harald_Laabs · October 28, 2014, 7:53pm

[...]

We designed our setup with an inventory of 3000+ hosts so that groups
are made dynamically, rather than pre-made. The reason is that every
permutation of a property (dmz/internal, prod/uat/int/dev/mgt,
virt/phys, os-version, various roles, app-owner, os-version, ...)
quickly inflates the number of groups one may need to have. And not
all our playbooks make actual use of these groups anyway.

So instead of having the inventory script (in advance) create all the
groups (huge inventory) and have ansible import all this data, we
only have it export the hostvars that can be used to make those
dynamic groups (group_by) and do it dynamically, on a need-basis.
This works well, although was slow with v1.6 when running for all
3000+ hosts, but it drastically improved in v1.7 to only a few
seconds.

Here it is 3500 hosts and slowed down from ~10 to 25 seconds with 1.7.
How many is "a few"? I'd like to compare after patching the other two
issues... those might also gain you another 2 seconds as they
(partially) depend on the number of hosts.
I'll keep the hostvars+group_by solution in mind if we need more groups
in the future. Most of the current groups are useful in many places, so
after patching the current bottleneck its probably not worth the extra
complexity to first create groups in all playbooks.

Although in the past we seldom ran playbooks on all systems, we now
use ansible for reporting connectivity issues, and that runs on all
systems regularly to identify network/routing/firewall issues. This
is not a typical environment, we have 7x more VLANs than we have
Linux systems (don't ask).

BTW the dynamic inventory script is not a real-time process, since it
takes time to get all information out of vSphere, Satellite, Infoblox
and DNS. We create a cache every hour through cron, rather then do it
on-the-fly.

All our inventory data comes from one database so it is fast and the
cache gets created every minute. Even with 2-3 seconds runtime a cache
will make sense for everyone using ansible in some interactive way.

Thanks for the feedback!

Harald

Bob_Chen · October 29, 2014, 7:46am

Indeed combie_vars need to be responsible for this, it might take a long time (16s+ on 453 groups and 132 hosts), I think in most situations, vault feature will not be used, by make a test before combie_vars, we can save a lot of time in most situations:

https://github.com/iambocai/ansible/commit/a09a2a388f9ea743caa8de10c48ac55867ba0c0b

在 2014年10月28日星期二UTC+8上午3时52分55秒，Serge van Ginderachter写道：

Harald_Laabs · October 29, 2014, 6:07pm

This patch probably does not do what you want.
With the patch host_vars and group_vars files are never touched again.
So this does not only affect _vault_password usage but at least
everyone with variables in files. (Did not test hostvars from _meta.)

Jeff_Reter · May 22, 2015, 10:04am

any news on this?

go a big inventory too, and its really slow
(patch helped, thanks; but its still to slow)

Harald_Laabs · May 28, 2015, 7:15am

There are a few things that might be helpful to know when using big
inventories.

1. the official solution seems to be to use tower or dynamic (small)
inventories (see cloud docs)
(both not an option for me for various reasons but either there are no
users with many hosts and groups or they are mostly happy with those)

2. in theory you can let ansible merge several inventories (e.g. you
have two distinct sources of data). DON'T. It will cost a lot of extra
time. (Nobody here wanted to debug that, was not important)

3. it would be possible to reduce the number of lstat() calls (down to
less than 1%) with a small patch. After profiling an ansible startup
sequence I found it was not my main problem but you might have
different results for your setup.

If you send me the number of hosts and groups and output of
python -m cProfile /usr/bin/ansible $IP -m ping
I'd like to compare where time is lost and look into this again.

Maybe someone will even pull patches for this after v2 is released

Best regards,
Harald

Dave_Rawks · December 15, 2015, 11:27pm

I’m still seeing this behavior in 1.9.4, did you happen to post your patch online anywhere?

Abhijit_Menon-Sen1 · December 16, 2015, 5:00am

I'm still seeing this behavior in 1.9.4, did you happen to post your
patch online anywhere?

Hi Dave.

Have you tried with devel? There are quite a few patches that should
help with large inventories, including one of Harald's:

commit 73d6da757faf57e43df52e3332a7a3b9ee67cdba
Author: Harald Laabs <github@dasr.de>

Harald_Laabs · December 16, 2015, 8:03am

The PR for this is here:

https://github.com/ansible/ansible/pull/9437

It was rebased and merged to 2.0 but if you need it before 2.0 is stable you can just apply it to 1.9.x yourself.

If you already have performance issues with 1.9 you probably want to wait and hope for some patches in 2.0 before you try use it. Reading the inventory might be faster there now. Everything else is much slower.
Testing with 2.0 is a good idea for anyone with a lot of playbooks and roles. You will find the typos in your playbooks that 1.x silently ignored, the options that should be used in the future (e.g. become, can already be cleaned up now) and changes (e.g. in escaping " with less \\\) you need to roll out the moment you move over to 2.0.

I'm not sure if anyone already tried to create a list of changed behaviour from 1.x to 2.0. Might be a bit difficult before the developers decide what they consider a bug and which previous ways to do stuff were never documented and do not need to be supported

Best regards,
Harald

Dave_Rawks · December 16, 2015, 5:08pm

Cool, thanks for the patch. It didn’t result in the speed up I was hoping for. I think the big thing for me is the stat per host to check for host var files. Even with the “_meta” key in my dynamic inventory the startup time is dominated by lstats for non-existent files. Seems like it does 7 lstat calls per host in the inventory even if the parent directory for the constructed path doesn’t exist. I think guarding that call with a stat call on the parent directory followed by an os.listdir could get this down from 72,000 calls to ~20 in my case.

-Dave

Harald_Laabs · December 16, 2015, 7:29pm

Most of the lstat() can be removed easily (just read the directory and check in the resulting hash if you need to deal with the file) but you might be disappointed by the performance gains there. You can just comment out the whole block to try how much it saves, in my case it was under 20% of the performance gain from the get_group patch. System calls are not that expensive compared to what ansible is doing elsewhere. (If you find a way to cause real io-operations for each lstat you can probably patch this while one ansible instance is busy reading the inventory. Otherwise use cProfile to check what you really want to patch...)

Harald

Harald_Laabs · January 20, 2016, 8:36am

Update regarding this issue in 2.0:
Patching the stat-madness saves about 40% of my startup runtime. There is a patch available from Tobias Wolf that just needs to be merged:
https://github.com/ansible/ansible/pull/13957
see also: https://github.com/ansible/ansible/issues/13956

Can someone explain why there are several json-libs supported in ansible? Using the fastest one that actually behaves in a sane way (and guarantees unicode strings as return values) seems a no-brainer to me.
Otherwise more tests for different behaviour are needed. And installing one more python library seems simple enough.
Immediate benefit: the other patch from the ticket saves another 20% startup time in useless unicode-string-checks.

If your inventory is small (less than a few thousand hosts and groups) you will not notice any real difference.

Harald

Topic		Replies	Views
Ansible 1.7.1 : Slowness at playbook startup Ansible Project	2	0	September 18, 2014
Dynamic inventory can take a long time to fetch hosts in group (aws_ec2) Ansible Project	0	19	November 2, 2022
ansible 1.6.2 -> 1.8.1 huge increase in "startup" time Ansible Project	7	1	December 3, 2014
Group_by taking 4 minutes for 2800 servers Ansible Project	4	1	October 9, 2014
slow inventory build Ansible Project	6	8	January 5, 2023

slow startup with many groups in inventory

Related topics