ssh performance tuning

Hi all,

I love ansible – except it’s very, very very slow for me. I’m wondering if my ssh settings are to blame. What do people advise for ansible w/r/t ControlMaster, and other ssh settings?

With all things in the internet, questions are best when qualified with real examples and numbers, and use cases. How slow?

Mine’s pretty zippy on stock CentOS 6.2 everywhere (control machine + nodes).

If you are using sudo mode, I’ll say, by necessity of sudo+paramiko (don’t ask) it must close and reopen connections when switching between exec and file transfer mode. Part of the whole refactoring push in 0.5 will enable batching
transfers together in one operation (I hope, using tar), which should cause a lot less hoping around. It however is still much faster to login as root.

If you are using Ansible to push ISOs, also obviously don’t do that :slight_smile:

–Michael

I killed all my test nodes, but without running the numbers again, we’re talking about 30 seconds per action.

re: Root login – I’ll give that a shot, but it waters down the security I’d set up. I have a playbook called bootstrap.yml that concatenates ansible_rsa.pub to the ~/.ssh/authorized_keys of a new user named ‘ansibler’ on the target instance with group ‘wheel’. Then I make sure that sudoers is set to allow NOPASSWD from wheel, and run everything else I do with ansible with user:ansibler with sudo:True. This way, all ansible activity is logged, and I can specifically revoke the key (and terminate ansible’s control) if either (a) security is compromised or (b) I hand over control of an instance to a client who wants to go it alone. (It’s very important to us that there be no lock-in and that the strings be both easy to cut and highly visible.)

Ultimately if you’re debugging connection speed I’d recommend putting in some debug to connection.py that prints out some timestamps, or something like that. That was useful when helping another look through some similar issues in the past.

That is rather long.

I’d also make sure you were using the development branch, if you weren’t already, especially if running under sudo mode.

We’ll have “all activity logged” in 10.4 using syslog soon, such that all modules will log themselves, so that may allow you to remove that – though I’d expect it to be a lot faster than 30 seconds, unless we’re talking about overseas through 3 layers of VPN :slight_smile:

That's a pretty neat setup, security-wise. I use sudo as well, and
get about one (small copy) task per two seconds to a single machine.
With -f32 to 32 machines, I get one task per ~5 seconds. I do not see
any real difference between sudo and non-sudo times. This is with the
development version, which should have very little overhead for sudo
use. Are you using the development version? How many machines and
how many forks are you using?

-John

Just a friendly reminder, but people should really check out and test ansible-pull.

I’d like more comments on it (probably start a new thread)

You can even modify the ansible-pull setup playbook to run ansible-pull via regular ansible at the end, versus using cron, if you like.

The only requirement would be that you would have to package any modules you used beyond what were in the core and make them available via yum/apt
(or otherwise push them).

–Michael

Hi Elizabeth,

I had a lot of problems in this regard, and finally (in my case) nailed the problem down to my executable inventory script. If that’s not your problem, ControlMaster isn’t going to help (Paramiko doesn’t use it). Some possible items on your host are faulty ldap or reverse dns. On the Ansible side, you’ll soon see some speedups with the refactoring that’s planned for 0.5. Those aren’t going to help much, though, if your ssh setup is not correct.

To test your ssh connection speed, check how long it takes to run this: time ssh you@host w

If that’s under a few seconds, then I don’t think ssh is your problem.

Are you using an executable inventory file?

I wasn’t using an executable inventory file, but I’ve managed to fix performance with root and sudo:False. Everything is now reasonably snappy. I do intend to attempt to replicate the original problem, however, and I’ll keep you posted. I’m also now running and testing the dev branch. Fun. Keep up the awesomeness.

While I was looking for a solution, I recalled that ansible has a flexible transport, and ssh is just the default. I’ve heard (but not seen in-person) that mcollective is very quick, if you’re ready to swallow the overhead of a STOMP server. mcollective might make a great transport for ansible for largish networks. Just a thought.

That’s a horrible technology choice for numerous reasons – RabbitMQ in general may be a decent discussion.

I am not opposed, nor have I ever been opposed to, something basic on top of RabbitMQ. This would mean an optional Ansible daemon, which is not a priority for 0.5, maybe 0.6. HOWEVER….

Matthew has done some nice work to use libssh2, which is a LOT faster than paramiko, and that will be available in 0.5, and to avoid diluting focus I won’t even consider a message bus until that release goes out. With playbook optimizations in 0.5 + libssh2, I suspect it will be totally unnecessary.

There’s also ansible-pull which Stephen Fromm wrote. It’s awesome and it scales like mad, with git and cron … and no ssh. For many people that will be sufficient, and it TOO will be growing in 0.5

If that still is important, we have the nuclear option of writing our own daemon, and doing crypto, and all that stuff… but no way will we ever use mco.