We’re evaluating Ansible and other config management tools. I have two issues I would like input from others:
if you have to change ssh keys, what’s the best way to do that across tens of thousands of machines?
if you have tens of thousands of servers under Ansible management, how do scale this to do them all quickly? Ideally, I want to be able run through a playbook across several thousand systems at once (assuming the playbooks will not be downloading additional packages from other hosts). Would be great if Ansible could have multiple controlling hosts but I don’t think this is feature.
There are various key management tools, all have their purpose. In CoreOS you could use cloudconfig/cloudinit for example. You can also use Ansible in raw mode to install the keys if needed in a bootstrap method.
Ansible is used in large sites to control great numbers of hosts. I recall several talks from Rackspace siting the running of a single playbook on thousands of hosts some years back now. If you are looking at this, pooling the work will help control the impact at scale of the playbooks.
Ansible is a great tool and I hope it fits your needs.
Thanks for the comments. I would be interested to see how others scale out the control node(s). Obviously you can run the playbooks in batches, but this could still take a very long time to execute across tens of thousands of hosts. Plus, if the batch is too large it would overwhelm the control node. Would be nice to see how others are solving this problem.
Thanks for the comments. I would be interested to see how others scale out
the control node(s). Obviously you can run the playbooks in batches, but
this could still take a very long time to execute across tens of thousands
of hosts. Plus, if the batch is too large it would overwhelm the control
node. Would be nice to see how others are solving this problem.
If you are actually managing tens of thousands of hosts, you're
probably dealing with other issues that would make it worth your while
to consider buying Ansible Tower.
"""Asynchronous Actions and Polling
By default tasks in playbooks block, meaning the connections stay open
until the task is done on each node. This may not always be desirable, or
you may be running operations that take longer than the SSH timeout.
The easiest way to do this is to kick them off all at once and then poll
until they are done.
You will also want to use asynchronous mode on very long running operations
that might be subject to timeout."""
I read about that briefly yesterday. Thanks. Will need to read up more about this mode to see how the coordination works. I guess you just keep pulling at the end of all the batches?