Ansible scaling and optimization/best practices

Hi :slight_smile:

I am using Ansible to deploy a generic configuration on a very large network.
This network has more than 1000 hosts.

However, I am facing important scaling issues with templates. Somes just takes too much time to render.

My inventories structure is very simple. In /etc/ansible/inventories I have a file in YAML with all my servers, and for each server the associated specific parameters (ip, mac, etc).
Then, in group_vars/all, I have the global variables. Some servers are in a group, somes are in two groups, it depends. This is done using the children tag in the host file.

When rendering templates, with nested jinja for loops, Ansible us using 100% cpu on hosts, and just take way too much time to render the file. I found using tests that templates rendering time is nearly an t=exp(nb_servers) for some of my templates…

So I have multiple questions:

  1. Does an access to hostvars[myhost] is much slower than a direct access to a global variable ? So for example, if I only use direct access for all variables in group_vars/all is it faster ?

  2. Would it be a good idea, when using multiple time hostvars of the same host in a loop to load it one time in a variable and access it using this variable, then go to next host (host loop), and load it’s hostvars, etc. (don’t know if its well explained here…)

  3. Is there a way to access group_vars variables, without using the hack hostvars[groups[‘webservers’][0] ?]

  4. Is there a general way to optimize inventories for this kind of very large networks ? Is it possible for example, using groups, to reduce hostvars access time ?

I know I could split into multiple inventories, but some of my templates need access to all servers data to render…

Many questions, my apologies for such a big message.

System: Centos/RHEL 7.4, Ansible 2.4.

With my best regards

Ox

I’ve never had to template to this many hosts, but I’m curious what are you setting forks to?

If its the default, 5, then you can probably up it a lot and instantly get a lot more parallel processing.

If you have it set to 1000 I’d try scaling it down in case you are asking too much of the resources on your ansible host.

I may be out of date about this, but I think that once the group vars and host vars have been processed, in effect each host has its own set of variables that apply to it. I’d hope that would be stored in some kind of dict which should give relatively constant time access to variables associated with hosts. However its probably more complicated than that.

I hope experimenting with forks and examining use of system resources bears some fruits. I know others have addressed thousands of hosts but I don’t know if any special steps were taken to optimise inventory or variables.

All the best,

jon

Dear Jon,

Many thanks for your tips!

I was indeed using a low fork parameter, so after increasing it, it speeds up update of generic nodes.

But my real issue is even with only one host to deploy. For example, when deploying my DNS configuration on my DNS servers, with all my hosts to render in templates, it takes forever. Same issue if I want to generate an /etc/hosts file for a very large network without DNS (because I will have to render this file on each host). :frowning:

For example: https://github.com/oxedions/banquise/blob/banquise_1.1/salt/dns/reverse.jinja
It’s Salt template, but very similar (consider pillars.get as hostvars) :slight_smile:

And this 1000 nodes network is a small one, I need to be able to deploy to much more.

I did some tests:

  • When storing one time hostvars[myhost] into a variable, and using it multiple times after instead of using hostvars[myhost] each time, then I only get few percents performance increase. So you were right: it seems once rendered one time, hostvars are stored in some kind of memory.

  • Using aggressive jinja code optimizations (like when using C code on very old CPU), I could get few percents also, but no significant breakout.

  • When using groups to isolate some hosts, and so reduce hostvars calls, it significantly reduce rendering time. Could work to distribute load on multiple cores, I will investigate that.

  • When using a hack to access hostvars that are in the host file itself (I made a symbolic lync to group_vars/all and then accessed these values directly without hostvars), again rendering time is very small (few sec compared to multiple minutes).

So hostvars first calls are what take so much time. Which is something I can understand, considering that Ansible has to aggregate many directories and variables to build the hostvars of each host.

Using direct access to variables is an interesting hack, but you loose something very nice: ability to use variables to construct path in the inventories. Indeed, using hostvars[myhost][‘somethingstatic’][myvariable] I can look for a variable related to myvariable. But using direct access: myhost.somethingstatic.( what to put here ???) I failed to find a way to obtain this dynamic rendering (same issue with Salt by the way, outside of pillars.get you loose dynamic variables path and you have to use dirty hacks).

I am pretty sure there are ways to use Ansible to render very large templates and so scale to very large infrastructures (I mean, without cheating with scripts to render very large files). Maybe my inventories is just badly made.

I am open to any ideas :slight_smile:

With my best regards

Ben