with the update to 1.7 ansible startup times went up to 25 seconds for
me. After some debugging I found not really optimized functions in the
inventory parsing. With more than 1000 servers and 10.000 groups the
current code is not really usable.
Problem:
get_group(self, groupname) in lib/ansible/inventory/__init__.py is on
O(#groups) and is called #group times. O(#groups^2).
Solution:
Do not iterate over the list every time, hashtables should reduce the
total time to O(#groups * log(#groups)).
Can someone who actually knows python (and ideally knows what the
inventory code does) can check if I need to invalidate the cache in more
places? Or suggest a better way to implement this?
I'd feel better if someone can give feedback before I send a pull
request for this...
If I'm not the only one who feels ansible could be a bit faster to
start and would look at patches I'd appreciate feedback about that.
(There are about 120k getcwd/stat64/lstat64 sytem calls during startup
in my setup. 95% of that could easily be avoided with some caching.)
Changes to make inventory parsing faster is definitely awesome, but to the point of stats, we use an inventory plugin that provides a _meta key in the return json. This is where all the host/group facts are stored, and when ansible sees _meta it’ll avoid doing any stats to read groupvars or hostvars files. This was a HUGE win for us, and I recommend it to anybody that has a large set of hosts or groups.
Yeah, with what Jesse says, it sounds like your inventory script is forgetting to return “_meta”, as otherwise I don’t know of any inventory speed issues in 1.7.X.
Thanks, I missed the _meta option so far. But my test-setup still calls
stat() on everything despite _meta/hostvars. (1.7.2 and 1.8.)
It also uses the hostvars I passed to it. But it does not seem to work
for groupvars at all - you use a plugin/patch for that?
My main problem is actually NOT the syscalls, it is get_group(). I
suspect that would still be called even if _meta would work as intended.
And with 20k groups and O(number_of_groups^2) there is a speed issue.
Somehow it was still reasonable in 1.6 (get_group() is unchanged, so
either it was not called as often or something else changed) and is
slow since 1.7.
The patch https://github.com/hrld/ansible/commit/6f349c4d09c5003f970981b247e99ede12ad0d6c
does nothing about the stat-syscalls but speeds up get_group() by using
a hashtable instead of a list to look for groups. (n vs. log(n) - with
my n being ~20.000 that helps a lot...)
Assuming that most users will not have group vars and host vars for
more than 10% of their hosts and groups (or less than 1000 of each - so
it would not matter) reading the relevant directories and keep a
hashtable of the existing filenames should make big inventories with
hostvars/groupvars in files much faster. Is there any reason not to do
that?
lstat64 is called on $PATH_OF_INVENTORY/host_vars/$IP.* and
$PATH_OF_INVENTORY/group_vars/$GROUP.*, about 14k calls with
"host_vars" and 48k with "group_vars" in it.
lstat64("ANSIBLE/inventory/group_vars/mygroup", 0xbfc09f5c) = -1 ENOENT
(No such file or directory)
lstat64("ANSIBLE/inventory/group_vars/mygroup.yml", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/group_vars/mygroup.yaml", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/group_vars/mygroup.json", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/group_vars/all", {st_mode=S_IFREG|0664,
st_size=71, ...}) = 0 lstat64("ANSIBLE/inventory/group_vars/all.yml",
0xbfc09f5c) = -1 ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/group_vars/all.yaml", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/group_vars/all.json", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/host_vars/127.0.0.1", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/host_vars/127.0.0.1.yml", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/host_vars/127.0.0.1.yaml", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
lstat64("ANSIBLE/inventory/host_vars/127.0.0.1.json", 0xbfc09f5c) = -1
ENOENT (No such file or directory)
If I understand you and the documentation correctly there should be no
attempts to access host_vars/*.
Ansible is using the ssh-port from the _meta block, but something must
still be wrong here?!
Thanks, I missed the _meta option so far. But my test-setup still calls
stat() on everything despite _meta/hostvars. (1.7.2 and 1.8.)
Tha major difference that happened between 1.6 and 1.7, is that
- until 1.6 group/host_vars were only read for the hosts in the run (that
is with a possibke --limit applied), *and* they were only read for new host
that weren't touched at the start of each play.
for big evnrionments this was less optimized, as it -re-read every
group_var file for every host that was member of the respective group
- starting with 1.7, the whole inventory, and *all* host/group_vars are
read at the very start, when initializing the inventory, but eacht
group_vars file now is only read once
For very large inventories - like you have - that first initialilsation
tends to take a but longer, but the full time of running the whole play is
shorter.
Basically, from 1.7.2, every host_vars and group_var file should only get
read once, but they *all* are read once. Given your extremely lareg numbers
of hosts and groups, I can see this might get less performant than it used
to be.
But your issue, a I understand, is not about reading lots of files, but
about iterating through the groups a host is member of, by name, to
retrieve the group object, which is what your patch addresses.,
Given that a host might me member of a *very* large set of groups, I can
see your patch being a good optimization. I'd vote you propose a pull
request for it.
Correction: _meta was already in the original inventory. No surprise
that it did not speed things up. If ansible will always try to read all
the files there is probably nothing left that can be improved without
patching the inventory-parser code.
Hosts: 3500
Groups: 12000
Host entries in groups: 60000
lstat() calls that could be optimized away by reading the two directories
and caching the list of existing files:
(3500+12000)*4
So if nobody sees a good reason to leave it the way it is I'll also
write a patch for that and should get the runtime for a single
ansible-ping down from 25 seconds to about 10 seconds in our case.
(13 seconds with the current patch.)
We designed our setup with an inventory of 3000+ hosts so that groups
are made dynamically, rather than pre-made. The reason is that every
permutation of a property (dmz/internal, prod/uat/int/dev/mgt,
virt/phys, os-version, various roles, app-owner, os-version, ...)
quickly inflates the number of groups one may need to have. And not
all our playbooks make actual use of these groups anyway.
So instead of having the inventory script (in advance) create all the
groups (huge inventory) and have ansible import all this data, we
only have it export the hostvars that can be used to make those
dynamic groups (group_by) and do it dynamically, on a need-basis.
This works well, although was slow with v1.6 when running for all
3000+ hosts, but it drastically improved in v1.7 to only a few
seconds.
Here it is 3500 hosts and slowed down from ~10 to 25 seconds with 1.7.
How many is "a few"? I'd like to compare after patching the other two
issues... those might also gain you another 2 seconds as they
(partially) depend on the number of hosts.
I'll keep the hostvars+group_by solution in mind if we need more groups
in the future. Most of the current groups are useful in many places, so
after patching the current bottleneck its probably not worth the extra
complexity to first create groups in all playbooks.
Although in the past we seldom ran playbooks on all systems, we now
use ansible for reporting connectivity issues, and that runs on all
systems regularly to identify network/routing/firewall issues. This
is not a typical environment, we have 7x more VLANs than we have
Linux systems (don't ask).
BTW the dynamic inventory script is not a real-time process, since it
takes time to get all information out of vSphere, Satellite, Infoblox
and DNS. We create a cache every hour through cron, rather then do it
on-the-fly.
All our inventory data comes from one database so it is fast and the
cache gets created every minute. Even with 2-3 seconds runtime a cache
will make sense for everyone using ansible in some interactive way.
Indeed combie_vars need to be responsible for this, it might take a long time (16s+ on 453 groups and 132 hosts), I think in most situations, vault feature will not be used, by make a test before combie_vars, we can save a lot of time in most situations:
This patch probably does not do what you want.
With the patch host_vars and group_vars files are never touched again.
So this does not only affect _vault_password usage but at least
everyone with variables in files. (Did not test hostvars from _meta.)
There are a few things that might be helpful to know when using big
inventories.
1. the official solution seems to be to use tower or dynamic (small)
inventories (see cloud docs)
(both not an option for me for various reasons but either there are no
users with many hosts and groups or they are mostly happy with those)
2. in theory you can let ansible merge several inventories (e.g. you
have two distinct sources of data). DON'T. It will cost a lot of extra
time. (Nobody here wanted to debug that, was not important)
3. it would be possible to reduce the number of lstat() calls (down to
less than 1%) with a small patch. After profiling an ansible startup
sequence I found it was not my main problem but you might have
different results for your setup.
If you send me the number of hosts and groups and output of
python -m cProfile /usr/bin/ansible $IP -m ping
I'd like to compare where time is lost and look into this again.
Maybe someone will even pull patches for this after v2 is released
It was rebased and merged to 2.0 but if you need it before 2.0 is stable you can just apply it to 1.9.x yourself.
If you already have performance issues with 1.9 you probably want to wait and hope for some patches in 2.0 before you try use it. Reading the inventory might be faster there now. Everything else is much slower.
Testing with 2.0 is a good idea for anyone with a lot of playbooks and roles. You will find the typos in your playbooks that 1.x silently ignored, the options that should be used in the future (e.g. become, can already be cleaned up now) and changes (e.g. in escaping " with less \\\) you need to roll out the moment you move over to 2.0.
I'm not sure if anyone already tried to create a list of changed behaviour from 1.x to 2.0. Might be a bit difficult before the developers decide what they consider a bug and which previous ways to do stuff were never documented and do not need to be supported
Cool, thanks for the patch. It didn’t result in the speed up I was hoping for. I think the big thing for me is the stat per host to check for host var files. Even with the “_meta” key in my dynamic inventory the startup time is dominated by lstats for non-existent files. Seems like it does 7 lstat calls per host in the inventory even if the parent directory for the constructed path doesn’t exist. I think guarding that call with a stat call on the parent directory followed by an os.listdir could get this down from 72,000 calls to ~20 in my case.
Most of the lstat() can be removed easily (just read the directory and check in the resulting hash if you need to deal with the file) but you might be disappointed by the performance gains there. You can just comment out the whole block to try how much it saves, in my case it was under 20% of the performance gain from the get_group patch. System calls are not that expensive compared to what ansible is doing elsewhere. (If you find a way to cause real io-operations for each lstat you can probably patch this while one ansible instance is busy reading the inventory. Otherwise use cProfile to check what you really want to patch...)
Can someone explain why there are several json-libs supported in ansible? Using the fastest one that actually behaves in a sane way (and guarantees unicode strings as return values) seems a no-brainer to me.
Otherwise more tests for different behaviour are needed. And installing one more python library seems simple enough.
Immediate benefit: the other patch from the ticket saves another 20% startup time in useless unicode-string-checks.
If your inventory is small (less than a few thousand hosts and groups) you will not notice any real difference.