ssh performance - follow up

Hello,

I mentioned on “ansible project” that I had difficulties with the performance of ansible on my “network”. In fact, I try to use ansible on a set of remote machines that are not hosted by the same provider.

The ssh round trip time is not very good.

  • my playbook of ~170 tasks was taking more than 20 minutes with paramiko
  • between 6 and 8 minutes with ssh + control persist, approximately the same with accelerate.

I drilled down (-vvv) to understand why it was still taking this long. I use a lot of sudo_user != root so for every command I have a minimum of :

  • one ssh roundtrip to create tmp directories
  • one ssh roundtrip to put the command file
  • one ssh roundtrip to chmod the file (sudo_user != root)
  • one ssh roundtrip to execute the command with the sudo_user
  • one ssh roundtrip to remove the tmp directories (sudo_user != root)

I have a working prototype that fully works on my playbook where this is minimized to one roundtrip instead of five ; my playbook now takes between 2-4 minutes (I’d say 2x gain over controlpersist in my slow network use case)

Basically for the modified “connection_plugin/ssh.py” :

  • no tmp files are created in the main case
  • the module is compressed on-the-fly (zlib) instead of written on the remote via scp/sftp
  • the compressed module is piped over ssh to the remote sudo_user (directly to the sudo_user, no need to chmod)
  • the remote sudo_user decompresses the module on-the-fly and pipes it to the extracted shebang

It is still a prototype and I had to take some shortcuts to make a proof of concept (only ‘new’ modules, only tested with ssh keys and not with password, …).

Do you think the speed boost and the lowering of the # of ssh roundtrips are sufficient to find room in the already busy ansible roadmap ?

I would really like if we could work on at least making it possible to use this as an alternative connection_plugin.
This would certainly mean a few additional features in the connection_plugin API, but the difference can be very light.

Tell me what you think. I’ll clean up the code in the next few days and send a github link with the diff.

So first off, sounds brilliant, especially where you have a high-latency connection where you can’t run Ansible more locally to the environment being controlled (which we almost always recommend).

I’d be super interested in seeing the implementation.

When changing the transport, though, I know we’ve arrived at fixes for a lot of corner cases – OS support, differing levels of SSH, etc. This sounds great, but we’re going to be a little conservative in testing.

I think my proposal – and the easiest way to do this, is to name the connection plugin something like ‘ssh_experimental’ and we can evolve it and see how it plays out, assuming the pull request is solid.
Provided that it is, we can keep it in tree for a while and see how development evolves, with the understanding that we might NOT ship it (TBD).

It probably wouldn’t become the default for a while but it sounds like it could be worth exploring.

Not supporting non-Python modules could be a bit of a deal-breaker, so we’d probably still have to keep the old code around or find for a way of it to work.

Can you start with a pull request we can use to start off the conversation? Again, I’d suggest naming it something like “ssh_alt.py” or something, and then we can more easily share with people without commiting to it.

In any case, I’d like to explore things!

Thanks!

BTW, we’re also going to definitely need --ask-pass and --ask-sudo-pass support if something were to become a default.

If this connection can’t support them, then maybe it can raise errors about that, but I’d much prefer if we can get to one implementation and not have to maintain duplicate code paths for that.

If we have to add some logic to runner to use the legacy transport transparently for “old style” modules, that would be ok though.

If there’s no clean way and we have to tell runner to use the new transport only if no passwords are used, we can go there, but I’d like to avoid that if at all possible.

Again, would be helpful to see the implementation first and we’d have a better idea.

Thinking out load, it may be possible to take the module common code to encompass the sudo logic server-side without too much black magic – but I suspect it might be a bit involved, it would depend on your appetite to try :slight_smile:

ok thanks,

I’ll

  • commit a copy of ssh.py to ssh_alt.py
  • commit the patch afterwards
    this way you’ll see the diff on ssh_alt.py

I think it can support the non-python module ; Brian Coca made me notice that “zlib compression” + “ssh -C” is probably not necessary so we can probably dump the compress / uncompress part and leave it so “ssh -C”

It hope it is possible to make it work with -ask-pass or -ask-sudo-pass (I did not remove the sudo -S option for example)

it may be possible to take the module common code to encompass the sudo logic server-side without too much black magic

this I did not understand.

do you think that things like “_make_tmp_path” could be removed from runner/init.py and put into the connection_plugin ? It seems to me that the “files moving around” strategy belongs more to the connection_plugin than to the runner.

Thanks
Jerome

it may be possible to take the module common code to encompass the sudo logic server-side without too much black magic

ok in fact I thought there was inheritance at the connection_plugin level but i was wrong now that i am cleaning my code.

can you give me an advice on feature detection ?

I need runner to know whether conn has the pipelining feature or not : should I just add a self.has_exec_pipelining = True/False on all connection_plugins ? do you have a preference for this ?

Thanks

"ok in fact I thought there was inheritance at the connection_plugin level but i was wrong now that i am cleaning my code.

Yeah, not so much, however, I’m not against moving some things into something like lib/ansible/connection_tools.py if it makes sense to share things – which it probably does. Even a base class could come from there while keeping the plugin architecture.

“can you give me an advice on feature detection ?”

Not sure I completely understand this one. There’s really only one major “feature detect” right now, which is a test to see what openssh is capable of in Runner, which just runs once right now. If there are things the connection can’t do, it would be ok if runner decides “hey this connection can’t do foo, let’s fallback”, if need be. But if that can be avoided that’s great too.

“I need runner to know whether conn has the pipelining feature or not : should I just add a self.has_exec_pipelining = True/False on all connection_plugins ? do you have a preference for this ?”

That seems fine – I have it on my list to refactor runner aggressively at some point to make it easier to follow, but as a general rule:

(A) I (for good or bad) don’t trust other folks to make large scale refactors of code – better to just add the new thing, no test will ever prove something is completely safe and with Runner logic I want to be very very careful.

(B) I get nervous when major new features and major refactorings done at the same time – because you have to test all the old stuff, and all the new stuff all over again.

Thus I’d probably try to change runner as little as possible right now, so we could be most comfortable with it, and we can worry about cleaning up the code later to make it easy.

So yeah, the ‘pipeline’ as an option in Runner seems ok, and then we can see about moving that out to something like the proposed connection_tools or another utilities file and simplifying after.

(Another thing we should probably do at some point is retool the multiprocessing/fork code so the same host gets the same fork every time, which will also enable some things like not closing the paramiko connections between runs, but… as I’ve said before, the only reason paramiko is there is because RHEL doesn’t ship a new enough openssh client. This is why we default to openssh automatically on platforms that can do ControlPersist)

Hope this helps – let me know if you have any other questions or I’m missing something!

Ah, forgot to mention…

It should be possible for plugins to inherit from other plugins, so if it makes sense for ssh_alt to reuse some common functions in the ssh connection, that’s an option too.

For base class purposes, we might just need to teach the plugin loader class to ignore init.py, though it should already do that. I forget :slight_smile:

ok. I a also a firm believer of baby-steps refactoring ; especially when modifying the runner/ssh impl of ansible :wink:

do you have an advice for “ssh -C”. should I leave it to the user to compress the ssh channel or add “-C” in there ?

My gut feel is it’s fine to add it in by code rather than requiring the user add it to ANSIBLE_SSH_ARGS, etc, because most people will not remember to.

Unless it causes problems.

Hello,

you can see the first version of ssh_alt.py here :
https://github.com/jeromew/ansible/commit/0e4757a43ed1c8cdee31399b2884de0cc94d4586

It works in my tests with --ask-pass (-k) and --ask-sudo-pass (-K)

This implementation still has drawbacks,

  • did not test with async

  • took a shortcut with _make_tmp_path by not creating it in ssh_alt. it will probably break all cases calling _transfer_str

  • did not test with accelerate

  • did not test with many things I suppose :wink:

I wanted to push the code so you can test and give me feedback on the overall structure of the patch

  • have your insights about what will obviously break here so we can make a list of specific use cases / commands to watch out for.

I made it so that we can easily check that the other connection_plugins are unaffected.

ssh_alt could probably inherit from ssh since only _exec_command() is modified ; depends on whether you want to make ssh_alt the default.

note that apart from adding “-C” to the ssh args, the new _exec_command has exactly the same code as before when indata=None
(all new code is guarded by indata) so ssh_alt.py would still be pretty safe as a direct replacement of ssh.py.

The code in the runner is I think safe as long as conn.has_pipelining = False.

Tell me what you think & test with simple playbooks first :wink:

example :

  • hosts: all
    gather_facts: no
    tasks:

  • command: echo “ping”
    with_sequence: count=10

Jerome

This implementation still has drawbacks,

Don’t have time to look too closely today, but this is remarkably shorter than I thought it would be! Provided we don’t hit too much weirdness, I think it could easily merge into the main connection ssh.py during this release and understand the conditions where it needed to not optimize things.

(also agreed about Async, connection plugins don’t know anything about it)

Some minor things.

Super minor but important code conventions: indata => in_data.

Error messages should probably be tweaked a little. Rather than “this connection plugin does not support indata|exec_command” it should probably say something like “Internal Error: this module does not support optimized module pipelining” or something like that. I know that’s an internal error message that the app shouldn’t hit, so maybe add a comment about that above it.

+            raise errors.AnsibleError("this connection plugin does not support indata|exec_command")

Some minor things.

Super minor but important code conventions: indata => in_data.

done

Error messages should probably be tweaked a little. Rather than “this connection plugin does not support indata|exec_command” it should probably say something like “Internal Error: this module does not support optimized module pipelining” or something like that. I know that’s an internal error message that the app shouldn’t hit, so maybe add a comment about that above it.

+            raise errors.AnsibleError("this connection plugin does not support indata|exec_command")

done

With this one:

 if indata:
+                raise Exception("SSH process should not use a pseudo terminal"

I’m curious how the code gets into that scenario? This is another internal error we don’t expect to hit, I’m guessing?

This is not a real exception.
it is a trick to go straight to the except: part of the try without duplicating code when in_data is not None.

I modified it to

         raise Exception("no pty => force except: case of try block")

the squashed commit is here

https://github.com/jeromew/ansible/commit/12a7f3040c479228a6d658cca335f22d0ce7e11a

branch https://github.com/jeromew/ansible/tree/ansible_ssh_alt

Jerome

“This is not a real exception.
it is a trick to go straight to the except: part of the try without duplicating code when in_data is not None.”

Hmm, exception emulating goto :slight_smile:

Yeah, let’s not.

Move the code that would-be-duplicated into a function and use an if test if need be.

Ok i’ll do something about It ; maybe not a function that´s just a call to popen here.

I removed the “goto like exception”.
squashed branch: https://github.com/jeromew/ansible/commit/15ff3c7fcde0360d3e5138201df8ef2153a231dd

tell me I you prefer that I squash the branch everytime of if you prefer seeing the difference.

we will discuss that after you look at the code, but here is a first footprint of what might break with ssh_alt because of the not executing _make_tmp_path.

It would be interesting to have feedback on those to see how they could be modified. I could for example

  • add a conn.put_source function that would pipe data directly to a file at the correct destination using source | “cat > dest”
  • use the shebang extraction technique to exec a script at its correct destination (script: module)

since no tmp_path is created, calls to put_file can be suspected :

./lib/ansible/runner/action_plugins/script.py : send the script file before execution
./lib/ansible/runner/action_plugins/unarchive.py : send the file that need to be unarchived in tmp.‘source’

./lib/ansible/runner/action_plugins/copy.py : copy would need to be checked, raw mode sends the file in tmp.‘source’

./lib/ansible/runner/init.py in _transfer_str

calls to _transfer_str can be suspected:

./lib/ansible/runner/action_plugins/assemble.py: transfer the resultant in tmp.‘src’. not very clear to me as the file is transferred and then the copy module
./lib/ansible/runner/action_plugins/template.py: same as assemble.py

./lib/ansible/runner/init.py in :

OK - def execute_module : transfer the file when not in pipelining mode

- def execute_module : transfer arguments when module_type != new. what are those ?

- def _copy_module : copy a module code to remote

calls to _copy_module can be suspected:

./lib/ansible/runner/action_plugins/async.py: not checked

Jerome

Yep, it basically is going to need some logic to decide what calls it should optimize and which it should not.

Another thing which may be something to look at, if not executing the module down the standard path, the module replacer code may need to replace the shebang line if ansible_python_interpreter is set.

regarding the shebang / ansible_python_interpreter, my modifications still rely on the same code as before (calling ModuleReplacer) to grab the module content & the modified shebang.

you will see that I extracted from _copy_module a new function called _configure_module that both my code & _copy_module now use.

I don’t think there is something more to do on this side except if I misunderstand something.

Jerome