Speeding up very large playbooks

Hi,

We have a scenario where we pre-generate a lot of configuration on the controller host (targets are generally network devices, so not capable of running python).
Typical generation process involves pulling data from a number odelef different systems (over APIs), running some local modules - mostly ‘template’. A typical run for a single devices can have up to 500-700 tasks.

A single device can be done under 5 minutes in most cases (including deployment). All the generation already runs with “strategy: free”. Once we go to bigger deployments - 10-15 devices the time that’s need gets significantly longer (40-50 minutes is not unusual). We tried throwing CPUs at the problem, but it looks like only one-two CPUs get ever to 100% whilst the rest of them seems to be near-idle (regardless of “fork” values). There’s plenty of RAM too (utilisation hardly ever goes over 3-4GB).

Is there a way for Ansible to utilise all the CPUs? I realise this might not be a typical case, but we’re looking now at deployments that have 30-40 devices and waiting 3h for completion is not something we’d want to see.

kind regards
Pshem

Similar issue here (hopefully not hijacking the topic). I have a small-ish playbook that runs a handful of roles that distribute files, set passwords, cron jobs, etc. On a single host it runs in about 35 seconds. With Ansible 2.4 I could run it across about 1000 hosts and it would complete in just under 30 minutes. With Ansible 2.5 it takes 56 minutes and Ansible 2.6 it takes about 63 minutes. A longer playbook we run nightly across the same number of hosts in 3 and a half hours with Ansible 2.4 now takes 8 and a half hours with 2.6.

We’ve resorted to tricks to speed things up such as running the playbook in a wrapper that cuts the host list up into groups of 100 for each playbook run (running the big playbook with free strategy on 1000 hosts is a good way to exhaust all of the memory on our Ansible server).

Tips to identify the bottlenecks would be great.

We have a scenario where we pre-generate a lot of configuration on the
controller host (targets are generally network devices, so not capable of
running python).
Typical generation process involves pulling data from a number odelef
different systems (over APIs), running some local modules - mostly
'template'. A typical run for a single devices can have up to 500-700
tasks.

A single device can be done under 5 minutes in most cases (including
deployment). All the generation already runs with "strategy: free". Once we
go to bigger deployments - 10-15 devices the time that's need gets
significantly longer (40-50 minutes is not unusual). We tried throwing CPUs
at the problem, but it looks like only one-two CPUs get ever to 100% whilst
the rest of them seems to be near-idle (regardless of "fork" values).
There's plenty of RAM too (utilisation hardly ever goes over 3-4GB).
Is there a way for Ansible to utilise all the CPUs?

What I do is running multiple ansible-playbook against different host at the same time so it utilize more cores and memory.

I realise this might
not be a typical case, but we're looking now at deployments that have 30-40
devices and waiting 3h for completion is not something we'd want to see.

You should check out Mitogen, some workload manage to run up to 7 times faster.
https://mitogen.readthedocs.io/en/stable/ansible.html

Hi,

Just to note Mitogen scales quite poorly with the number of hosts just now, due to a single-CPU bottleneck regardless of the forks setting. There should be a new development branch by the end of the week to resolve this, that can already fully saturate (100% utilization) 8 cores given 100 targets, with no loss of speed up over single-host.

If you are struggling with a larger install and willing to guinea pig some new code (potentailly crashy, but not dangerously so!), please drop me a reply off-list.

David
dw@botanicus.net

Even with Mitogen’s current limitations, it cut run times about in half for our “do everything to all the hosts” playbook, which hits about 200 hosts with 1600 tasks in our biggest environment.

Just in the case you have not found this resource try https://docs.ansible.com/ansible/2.5/user_guide/playbooks_async.html

Thank you for the all the responses so far.

I had a look at mitogen, but in our setup almost all modules are run on the controller (templates, lookups etc), in fact we hardly use SSH at all (as targets are usually set up using XML-RPC or REST APIs), so I’m not sure how much it’s going to help here. I’ll run some evaluations later today.

I also looked at running multiple instances of ansible-playbook (for example one per target device). The biggest challenge is that for us - if the deployment fails on one device - we have to stop deployment on all others and roll them back (using a separate role). We’re deploying configuration (for example for a L3VPN) across a number of devices. If the deployment fails on one of them - we have roll back the lot. That’s relatively easy to do with only ansible-playbook running, but I couldn’t make this work reliably with multiple ones. That’s also the reason ‘async’ is not going to work for us.

I think for now we’ll have to optimise the playbooks further. It looks like the following areas take the most time:

  • include_* statements
  • single template operations (so consolidate the templates, for now a single device config can come from many templates that first get populated, then consolidated)
  • loops - reduce the number, particularly with include inside them

Any further suggestions are appreciated.

kind regards
Pshem

Also, consider pre-generating the configuration you need so there’s less need to template at runtime.
As well as (hopefully) speeding things up a bit, you can examine or even validate the configuration you are intending to apply before actually applying it to your devices.
Hope this helps,
Jon