RHEL6.6 and ControlPersist

Hi,

As some of you may know, Red Hat backported the ControlPersist functionality to the OpenSSH version that ships with RHEL6.

This is terrific since RHEL users can now use this technique to speed up Ansible.

However, after some testing it seems to fail for the very first connection. What happens is that the first connection, when the persistent connection has not been set up yet, fails. Any subsequent connection seems to work fine, but obviously this fails to work properly with Ansible.

I think this is a bug, has anyone tested this ?
Or am I doing something wrong here ?

Dag Wieers wrote:

As some of you may know, Red Hat backported the ControlPersist functionality to the OpenSSH version that ships with RHEL6.

This is terrific since RHEL users can now use this technique to speed up Ansible.

However, after some testing it seems to fail for the very first connection. What happens is that the first connection, when the persistent connection has not been set up yet, fails. Any subsequent connection seems to work fine, but obviously this fails to work properly with Ansible.

I think this is a bug, has anyone tested this ? Or am I doing something wrong here ?

I noticed this after updating to one of the later test packages from the bugzilla ticket where this was requested (it did not occur initially). I also thought it might be something peculiar to my environment and didn't notice it soon enough to bring it up in the ticket, unfortunately.

I haven't installed the final update yet (on CentOS), but I was hoping that it might have corrected the problem. At least I planned to test the latest package before considering it a package bug.

Knowing it seems to affect more than me, I imagine a new bug report is in order to resolve the problem.

It is fantastic to have ControlPersist on EL6 though. :wink:

“I think this is a bug, has anyone tested this ?”

Sounds like this should be reported with RHEL, definitely.

Please do and post the bugzilla here if you can.

While we could add special code to say “don’t try CP on EL6 if ssh says it CAN CP” it seems shipping the feature broken would have been detected by them.

I ran into this as well, but haven't had too much time to isolate it.

However, I did find that controlpersist works with ansible, but if you
enable pipelining, then it hangs on the first connection.

So, perhaps there's a issue around pipelining?

kevin

Kevin Fenzi wrote:

"I think this is a bug, has anyone tested this ?"

Sounds like this should be reported with RHEL, definitely.

Please do and post the bugzilla here if you can.

While we could add special code to say "don't try CP on EL6 if ssh says it CAN CP" it seems shipping the feature broken would have been detected by them.

I ran into this as well, but haven't had too much time to isolate it.

However, I did find that controlpersist works with ansible, but if you enable pipelining, then it hangs on the first connection.

So, perhaps there's a issue around pipelining?

With pipelining = True commented out on el6, things still fail for me on the first run. They do fail with a different error, but perhaps that's just due to slightly different code paths with pipelining on and off, I haven't looked at the ansible code at all.

In case it matters, I adjusted the control_path variable to avoid using ~/.ansible since $HOME is on NFS in my environment.

With pipelining disabled:

$ ansible web -m ping -o
...
web12 | FAILED => failed to resolve remote temporary directory from /var/tmp/ansible-tmp-1413420275.15-6744099385111: `mkdir -p /var/tmp/ansible-tmp-1413420275.15-6744099385111 && chmod a+rx /var/tmp/ansible-tmp-1413420275.15-6744099385111 && echo /var/tmp/ansible-tmp-1413420275.15-6744099385111` returned empty string

With pipelining enabled:

$ ansible web -m ping -o
...
web12 | FAILED >> {"failed": true, "msg": "", "parsed": false}

This is still using openssh-5.3p1-100.el6.x86_64 which was a scratch build that Petr made per RHBZ #953088. I have not checked whether there are any differences in the -104 packages included in the latest RHEL updates (and they haven't made it to my CentOS mirror yet).

This is also with ansible-1.6.10-1.el6 from EPEL. I have not yet updated to 1.7.x (I see 1.7.2-1 is in epel-testing now, so I'll prbably wait for that to hit stable).

HTH,

ansible-1.7.2-1.el6.noarch
openssh-5.3p1-104.el6.x86_64

with pipeline disabled:

% ansible -m ping -o junk02\*
junk02.phx2.fedoraproject.org | success >> {"changed": false, "ping": "pong"}

with pipeline enabled:

% ansible -m ping -o junk02\*
junk02.phx2.fedoraproject.org | FAILED >> {"failed": true, "msg": "", "parsed": false}

kevin

Same results here. I have ansible 1.7.2-2 and openssh-5.3p1-104 (installed from the CentOS Continuous Releases Repository), on CentOS 6.5.

“with pipeline disabled:”

Hmmmmm, curious.

Worst case we could detect RHEL 6 and auto-disable pipelining on that platform, what say ye?

Just to clarify – I was seeing the same results as Kevin, not Todd. Removing “pipelining = True” from my config fixed the problem.

Since that’s the default, I don’t know if you really need to change Ansible – it never worked on RHEL6 before, so nothing’s changed in that regard. But I hope they’re able to fix this issue in OpenSSH…did someone report it?

Jacob Weber wrote:

Just to clarify -- I was seeing the same results as Kevin, not Todd. Removing "pipelining = True" from my config fixed the problem.

I've since updated to 1.7.2 from epel-testing and I see the same results as you and Kevin. I've disabled pipelining as well for now, but it's definitely a performance hit (better to be slower and accurate though).

But I hope they're able to fix this issue in OpenSSH....did someone report it?

I haven't seen anything myself. I don't have a RHEL contract so I was hoping someone that did might file a ticket so it's more likely to get attention.

I’ve filed a github for now to include (in 1.8) a check to auto-disable pipelining on RHEL 6.6+ (but not EL7), which should resolve most of the confusion.

We also may make it print a warning if it was on.

But yeah, bugzilla seems appropriate.

Bugzilla from someone with a nice friendly Red Hat TAM even more so :slight_smile:

I've filed a github for now to include (in 1.8) a check to auto-disable
pipelining on RHEL 6.6+ (but not EL7), which should resolve most of the
confusion.

We also may make it print a warning if it was on.

But yeah, bugzilla seems appropriate.

Bugzilla from someone with a nice friendly Red Hat TAM even more so :slight_smile:

I've submitted a ticket to my team's TAM through Red Hat support
channels, if/when there is a public bugzilla as a side effect I'll
link it here.

-AdamM

When I saw this discussion thread, I was thrilled because I am using OEL6.5 with OpenSSH-5.3. Since that is equivalent to RHEL6.5, that meant that there should be an update to OpenSSH for OEL too. Sure enough, there is (openssh-5.3p1-104.el6.x86_64). But when I installed it, I could no longer use Ansible to copy files to other servers. I did not make any other changes (ssh_args, scp_if_ssh, control_path, and pipelining are all still commented out), and my ssh_config does not include any of the “Control*” parameters. Output from my quick tests are below:

Without “scp_if_ssh”:

sinudy36-> ansible sinudm07 -m copy -a “src=testfile dest=/var/tmp”
sinudm07 | FAILED >> {
“failed”: true,
“md5sum”: “d41d8cd98f00b204e9800998ecf8427e”,
“msg”: “\u001b]2;pdxmft @ :/home/pdxmft\u0007/usr/bin/python: can’t open file ‘\u001b]2’: [Errno 2] No such file or directory\r\n/bin/sh: pdxmft: command not found\r\n”,
“parsed”: false
}

With “scp_if_ssh”:

sinudy36-> ansible sinudm07 -m copy -a “src=testfile dest=/var/tmp”
sinudm07 | FAILED => failed to transfer file to /home/pdxmft/.ansible/tmp/ansible-tmp-1414615053.51-185971585352740/source:

scp: /home/pdxmft/.ansible/tmp/ansible-tmp-1414615053.51-185971585352740/source: No such file or directory

This is identical to what I observed when I tried to specify ssh instead of paramiko (-c ssh) and prior to upgrading OpenSSH. I opened another thread about that yesterday (Cannot copy a file to a server when using ssh) before I saw this thread, but at that time I had to specify “-c ssh” in the command line to get this reaction while here it just happens regardless of what I do. Meanwhile normal scp and sftp from the command line functions just fine; I get the failures only when I try to use Ansible to copy files.
Something about the way that Ansible is calling scp or sftp appears to trigger this bug in the latest version of OpenSSH. This will be a major problem as it now means I cannot upgrade OpenSSH for any reason and still be able to use Ansible. For now I have rolled back to openssh-5.3p1-94.el6. I hope somebody finds a solution soon.
-Mark

I deployed a new OEL6.5 server, then upgraded OpenSSH to the new ControlPersist release and installed Ansible onto it. Without making any configuration changes to anything other than adding server names to the hosts file, I tried to use Ansible to copy a file to another server, and it failed as before. I downgraded OpenSSH to the previous version and tried the copy again. It worked perfectly.
So, there is definitely a mismatch of some sort between Ansible and the newer release of OpenSSH on Linux 6.
-Mark

I ran into similar issues using the new ControlPersist option as well as the ProxyCommand option. A Red Hat bugzilla was created that has the details. I think the part in comment 1 starting at “The commands and output below show that ControlPersist=yes does not work as expected.” is what you are referring to. There is a patch for openssh-5.3p1-104.el6.src.rpm attached to the bug that is from me backporting code from the RHEL 7 openssh related to the ControlPersist option. I don’t run Ansible, so I have no way of testing to see if the patch fixes your issue. However, I would be interested to know if it does.

just a heads up,

I run RH6.5, not able to upgrade at the moment to 6.6 (and it looks like it wouldn’t help either), I have worked around the ControlPersist issue by installing a openssh6 client on my control host box (/opt/openssh6),
I then have a wrapper script that calls ansible-playbook and sets the PATH to collect ssh and friends from /opt/openssh6/bin before /usr/bin.

because it only uses the openssh client (no daemons running), there’s no conflict with the normal redhat packages.
Its so much faster

Let me break you the news that Red Hat has released an openssh update that fixes the reported issue with ControlPersist.

    * Thu Nov 06 2014 Petr Lautrbach <plautrba@redhat.com> 5.3p1-104.1
    - Fix ControlPersist option with ProxyCommand (#1160487)

And it works well. Joy !

I checked the Oracle repository and found openssh-5.3p1-104.el6_6.1. I installed that and tested. Nice!!! It looks like that patch fixed it.
-Mark

Got it on CentOS too, and turned pipelining on. Can’t say that I’m seeing much of a performance difference, but I’m not getting the errors either. I’ll do some testing on a longer playbook later.

Yeah, my ~50 minute playbook is down to about 46 minutes now. Not sure why I’m not seeing the difference that others are. I do see the ControlPersist files being created in ~ansible/cp. It’s running about 80 plays on each of about 20 hosts. I guess the SSH part of Ansible wasn’t adding that much overhead to begin with.