Bad performance: In Ansible 2.1 playbooks take factor 5.6 longer than 1.9.x

Salvete all,

we are still clinging to 1.9.x ever since 2.0 came out but we’ve been planning to move to 2.1 once it comes out.

We had already figured out that ansible 2.x starting up with parsing our dynamic JSON inventory (7MB of text, almost 9000 groups, excluding ‘_meta’ hostvars) takes over 20 seconds versus 5 seconds before on 1.9. (Please see my PR #13957 on that)

Now we’ve started running stable-2.1 with --check on some of our playbooks with the incompatible changes cleaned up (backslash escapes, etc).

Ansible branch stable-2.1 takes more than 5 times longer to complete than 1.9.4!

The test playbook rolls stuff out on about 150 hosts out of a couple thousand in the inventory. There are a lot of tasks on each host and already visually it is apparent that something is very different compared to before. Even skips tick by at a snails pace (up to 0.5s per skip). The main ansible-playbook process holds 500MB res memory.

On ansible 1.9.4 this completes in 23 minutes. On 2.1 in 2 hours, 9 minutes.

This is ansible.cfg:

[defaults]
forks = 55
force_color = 1
roles_path = roles
hostfile = inventory
#library = library
retry_files_enabled = False
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=61s -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=yes -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey,keyboard-interactive
pipelining=True

Since people were suggesting it, I tried fact caching with redis or the free strategy. This did nothing significant to the runtime (as expected since the time is not spent in gathering, it’s spent forking the main process or something):

ansible 2.1.0.0 (stable-2.1 be28443943)

15125.40user 6640.76system 2:08:29elapsed 282%CPU (0avgtext+0avgdata 3262800maxresident)k
304inputs+1046376outputs (5major+613591630minor)pagefaults 0swaps

ansible 2.1.0.0 (stable-2.1 be28443943) – fact_caching = redis
14469.46user 6497.10system 2:04:38elapsed 280%CPU (0avgtext+0avgdata 3052820maxresident)k
728inputs+1028496outputs (18major+606686383minor)pagefaults 0swaps

ansible 2.1.0.0 (stable-2.1 be28443943) – strategy = free
16394.78user 8026.57system 2:19:29elapsed 291%CPU (0avgtext+0avgdata 1231688maxresident)k
80inputs+1084288outputs (0major+626770671minor)pagefaults 0swaps

We really want to run 2.x but a this point, we are considering maintaining 1.9.x for ourselves with backported modules. I wanted to bring this up on the mailing list first, to see if there is anyone on the dev team who has a grip on what is going on with 2.x at all? What is happening that throttles playbook execution so much? Is this going to improve anytime soon? If I wanted to explore this further, what would be the approriate tools to profile/benchmark/analyze the problem further with our workload? Apparently this has not come up before, otherwise — I assume — 2.0 would not have been released like this. So I wonder what is different in our setup that slows Ansible 2.x to a crawl. I suspect it is again our large inventory, but i’d like some help to figure it out.

cheers,
Tobias

Do you have a lot of vaulted vars?
I installed cryptography from pip (after some messing around meeting cryptography’s dependencies) and that improved speed for me a lot (although I only have a few groups and fairly small inventory) - as mentioned here: http://docs.ansible.com/ansible/playbooks_vault.html#speeding-up-vault-operations

which python version are you using?

Do you have a lot of vaulted vars?

We do not use vault at all. For secret vars we use locally stored persistent facts and a FUSE mounted sshfs.

which python version are you using?

2.7.11

So, I’ve collected some metrics about our setup. I got starved for memory once and OOM killer when trying to count our hostvars.

$ nice -n 19 ansible all -m debug -a var=hostvars | pv | wc -l
9.84GiB 0:56:07 [2.99MiB/s] [ <=> ]
291874566

And hostvars relevant to the test playbook only:

$ ansible “foo_type=bar_baz:&host.os=Ubuntu_14.04” -m debug -a var=hostvars | wc -l
11247516

Number of groups:

$ inventory/inventory.foo.pl | jq ‘. | length’
9089

Size of inventory total (with ~140 hosts matching the play)

$ ansible all --list-hosts | wc -l
3843

Number of tasks in the test playbook

$ ansible-playbook playbooks/foo/bar/baz_quux.yml --list-tasks | grep " [a-Z]" | wc -l
561

And this takes 129 minutes vis-a-vis 23 minutes on 1.9.4.

Might be worth using the profile_tasks callbacks to see where the time is being spent. Is every task just taking 6 times longer, or is it specific tasks.

https://github.com/jlafon/ansible-profile (it’s built into v2.0+ but that page has details of how to set it up)

In the above example of just running debug, it’s not clear whether it’s the inventory processing that is slow, or the fact gathering, or both.

Once you have profile_tasks setup, using a playbook as simple as:

  • hosts: “{{hosts}}”

with

time ansible-playbook playbook.yml -e hosts=“foo_type=bar_baz:&host.os=Ubuntu_14.04”

will help see where time is spent.

Might be worth using the profile_tasks callbacks to see where the time is being spent. Is every task just taking 6 times longer, or is it specific tasks.

I don’t think that is going in the right direction considering that everything is slower, most prominently skipped tasks as well. Seems to be a core issue concerning the execution logic.

Here’s another test result surrounding size of inventory:

full inventory minimal inventory

(9000 groups) (1 group + _meta)

---------------±----------------------------------- ------

ansible 2.1 | 12m45s 4m59s 2.6x

ansible 1.9.4 | 3m34s 2m26s 1.5x

1 Play, 148 hosts, 22 tasks

It’s seen that inventory size strongly affects performance, but much more so on Ansible 2.x.

I’ve tried deactivating the dynamic logic that seems to be the main “feature” of 2.x, but adding these lines to ansible.cfg had no measurable effect:

Congratulations on releasing Ansible 2.1 today. To celebrate I want to share this slightly amended xkcd:

I’ve looked at cprof stats of the aforementioned test playbook.

The first line in the stats tells everything. In Ansible 1.9.4 there were 600k Python function calls to complete the playbook. In Ansible 2.1 there were 50 million.

Hope this will be addressed at some point.

–Tobias

Hey, thanks for this.

Seeing this thread again has reminded me that I was curious about the number of groups you have in relation to the number of hosts. If I read the above right you have nearly 9000 groups but ‘a couple of thousand’ of hosts. So hosts are members of multiple groups.

I’m interested to know what problems organising groups and hosts in this way solves - or perhaps its the only way to deal with whatever you are dealing with. Partly I’m just curious as you are clearly operating at a scale that I’m nowhere near, and partly just to see if the collective brains round here might have another way to organise things that might speed stuff up for you.

Jon

Hey, thanks for this.

Seeing this thread again has reminded me that I was curious about the number of groups you have in relation to the number of hosts. If I read the above right you have nearly 9000 groups but ‘a couple of thousand’ of hosts. So hosts are members of multiple groups.

That’s right. Instead of simple membership to a group like “westcoast” we encode a lot of meta-information in the inventory in a sort of key-value notation. I described some of this in my issue #13956.

Any particular host is member to quite a few groups. “host.os=FreeBSD” is a group name, as is “status=productive”. Then there are groups related to services, of which several belong to any particular host: “service.type=foo”, “service.country.name=bar_baz_quux”, and so on and so forth.

Lots of groups only contain one host, some contain thousands.

I’m interested to know what problems organising groups and hosts in this way solves - or perhaps its the only way to deal with whatever you are dealing with. Partly I’m just curious as you are clearly operating at a scale that I’m nowhere near, and partly just to see if the collective brains round here might have another way to organise things that might speed stuff up for you.

The thing is, that even stripping the inventory to the bare minimum Ansible 2.x is CPU bound much of the time here. It can’t be that most of the runtime was spent in time.sleep in 1.9.x, waiting for tasks to complete on remote hosts. In 2.x it is calling 85 times as many functions as before to do the same job.

More crazy profiling results.

Ansible devel, 143 matching hosts, some 15 tasks. Inventory:

$ inventory/inventory.xxx.pl | jq '._meta.hostvars | length ’

3845 # hosts

$ inventory/inventory.xxx.pl | jq '._meta.hostvars | length ’ | numsum

88285 # sum of hostvars

$ inventory/inventory.xxx.pl | jq ‘. | length’

8994 # groups

$ kernprof -l -o ~/2.2.0.kern.out /home/towolf/src/local/ansible/bin/ansible-playbook o-test.yml --check

$ python -m line_profiler 2.2.0.kern.out
Timer unit: 1e-06 s

Total time: 485.37 s
File: /home/towolf/src/local/ansible/lib/ansible/vars/init.py
Function: _get_magic_variables at line 374

Line # Hits Time Per Hit % Time Line Contents

Some work was recently started looking into some performance issues around inventory that may provide improvements. See https://github.com/ansible/ansible/compare/fix_inv_vars_loading

That looks interesting and seems like it expands on my earlier work-around for reducing excessive repetitive work (including hammering the file system) at the inventory parsing stage. I’m going to try this on Monday.

However there seem to be many more places where stuff is packed and unpacked and merged over and over again needlessly.

At first when the inventory is read, the groups and their members are evaluated and every member host packed into a Host object. And then later for every host in get_vars() the global list of groups is iterated over again to generate the ‘groups’ magic variable. So each group is unpacked, the Host objects looked at and their name extracted in that list comprehension. And we do this over and over again, for a hash that should be global and unique!

And on top of that every hostvar in the inventory is templated, when this is merely static key-value information, courtesy of the inventory.

I’ve submitted a Pull Request fro this issue. Please review: https://github.com/ansible/ansible/pull/16204

The line_profiler went from 480 seconds to 15 seconds inside that function.

The new profile map looks like the following. VariableManager._get_magic_variables shrunk into a corner.

Also, the cost for posix.fork went from 3ms to 12ms and is executed 3 times more often. This means we spend 33 seconds in posix.fork compared to 3.5 seconds before in 1.9.4.

Next I’m going to submit pull requests for cases where set makes more sense than list data type, i.e., repetitive access with not many adds and unique contents.

Hi Tobias,

Have you noticed any improvement on this front? I'm still on 1.9.4 because of these issues.

Thanks,
Igor

Hi Tobias,

Have you noticed any improvement on this front? I’m still on 1.9.4 because of these issues.

I ran a new, longer, benchmark, which is more relevant to daily work today.

An “apply common stuff” playbook with --check over 164 hosts.

Ansible 1.9.4: 9 Minutes
Ansible 2.1-stable branch: 51 Minutes !
Ansible devel: stuck somewhere in the middle, no progress two hours later, but using CPU nonetheless:

So, we are all still on 1.9.x. There´s still the mentioned pull request open, that improves devel drastically for us. But it´s still more than 2 times slower.

Bloody RedHell … I mean RedHat :slight_smile:

Just to give Closure to this chapter. After merging my PR ew are back into more manageable territory, runtime-wise:

An “apply common stuff” playbook with --check over 164 hosts.
Ansible 1.9.4: 9 Minutes
Ansible 2.1-stable branch: 51 Minutes !
Ansible devel: stuck somewhere in the middle, no progress two hours later, but using CPU nonetheless:

Ansible devel as of yesterday: 42 minutes

Ansible devel as of today: 16 minutes

Tobias, thanks for all your work, I can’t believe such a serious issues has been ignored for so long. Being a configuration manager Ansible is aimed to managing hundreds of servers in the same time and such a poor performance is just unacceptable. Taking an hour to apply changes to 100 servers, what a joke.

It’s good to keep this thread on the surface though so people are aware of the issues that have been introduced after version 0.9.4 and maybe, just maybe push for faster resolution in the production release.

At the end, if you can leave a note here when you move to version 2.x it would be very much appreciated.

Thanks,
Igor