connection locking in v2

Hi.

I investigated the locking in v2, and thought I would write down my
findings here for later discussion.

First, the motivation: if you run ansible against multiple hosts, and
some of them need host key verification (i.e. ssh issues a prompt and
waits for you to enter "yes"), then:

    1. The prompts for affected hosts should be displayed one by one
    2. Output from hosts that don't need prompting shouldn't be shown
       while any prompt is active.

#1 is a matter of inter-process locking in ssh.py, and is the subject of
this mail.

#2 is a matter of locking in display.py. @bcoca said on IRC that this is
complicated, because the locking in display doesn't interact well with
serialisation. So it's a necessary piece, but I haven't looked into it
yet. (Note that display.py currently only locks debug output.)

There's a commented-out lock_host_keys() method in ssh.py with the
following comment:

    # lock around the initial SSH connectivity so the user prompt about
    # whether to add the host to known hosts is not intermingled with
    # multiprocess output.

The way it is supposed to work is as follows (note that I'm describing
the intention here; the actual implementation has various problems[1]):

    1. We acquire the lock before exec()ing ssh, iff host key checking
       is enabled and the host key is not already known.
    2. We release the lock after the connection is done.

Here's what I propose instead, after a brief discussion on IRC with
@jimi-c:

    1. If host key checking is enabled, acquire the lock. (Without
       checking if the key is known.)
    2. Add an "echo BECOME-SUCCESS"-style command (CONNECT-SUCCESS?).
    3. As soon as we detect this string in the output, we can unlock the
       connection, because we know that we're past host key verification
       already.

This means that we would (a) hold the lock for a shorter period, not the
entire connection, (b) not have to scan/rescan known_hosts at all.

(I would have liked to be able to detect that we've connected without
having to add another magic string, but since we don't run ssh with -v
all the time, I couldn't think of a way to do it. Suggestions welcome.)

https://github.com/amenonsen/ansible/tree/connection-locking has a
prototype patch to reintroduce the connection lock. It's a temporary
file opened pre-fork in the TaskQueueManager, whose fd is passed in to
the PlayContext, serialised, and made available to workers. The basic
bits of the patch are simple, the only real question is about where to
call self.lock_connection()/self.unlock_connection() in ssh.py[2]. I've
not implemented the magic key stuff yet.

-- Abhijit

1. The biggest problem in lock_host_keys() as written is that it calls
   self.not_in_host_file() (a) twice, (b) in such a way that if ssh adds
   the host key to the known hosts file during the connection, as it's
   expected to, the lockfile won't be unlocked. But let's ignore that.

2. ssh.py also locks the creation of the ControlPath directory. This is
   disabled in devel for want of a prepare_writable_dir function. If the
   function were added, the connection-locking code would just work.

   Also, paramiko_ssh.py is simpler, because host key verification is a
   callback. The connection-locking tree already works there.

I investigated the locking in v2, and thought I would write down my
findings here for later discussion.

Update: @jimi-c merged two patches:

1. Make ssh.py's lock_host_keys() method a noop (as noted earlier, it
   didn't do any locking, but it still scanned known_hosts files once on
   lock and once on unlock for every connection). PR #12195

2. The connection-locking infrastructure (temporary file opened in
   TaskQueueManager, whose fd is passed in to PlayContext, plus
   lock/unlock_connection() methods in ConnectionBase). PR #12212

The basic bits of the patch are simple, the only real question is
about where to call self.lock_connection()/self.unlock_connection()
in ssh.py[2]. I've not implemented the magic key stuff yet.

So the current situation is that paramiko is fine, but after discussion
we left ssh.py locking alone, so this part of my original mail is still
an open question.

-- Abhijit

Thanks for the detailed analysis, Abhijit. Much appreciated.

Another update: I've implemented "sensible" unlocking in ssh.py; see
https://github.com/amenonsen/ansible/blob/ssh-locking/lib/ansible/plugins/connections/ssh.py

I made changes to ssh.py in two phases (separate commits in my
https://github.com/amenonsen/ansible/tree/ssh-locking tree):

1. I reorganised the methods so that responsibilities are clearly
   demarcated: _build_command builds the command-line, _run does all of
   the connection handling, and exec_command/put_file/fetch_file are
   just wrappers around both of the above.

   This is mostly just moving code around; there are some functional
   changes, but they're minor and incidental (e.g. caching the existence
   of sshpass rather than testing it every connection).

2. Turned the internals of _run into a proper state machine. I found the
   earlier code hard to follow (it was split across multiple functions,
   part of it was non-blocking and part blocking, etc.), and didn't feel
   confident about implementing the unlocking just by sprinkling tests
   inside both loops.

The current status is that the new code works with every combination of
sshpass/sudo/su/pipelining that I've tried it with. There are some small
tweaks needed (it doesn't remove the sudo prompt from the output the way
the old code did, for example), but those are easy.

Feedback welcome: https://github.com/amenonsen/ansible/tree/ssh-locking

-- Abhijit