mysterious failures with truncated output, after upgrading managed machine to 4.2.0 kernel

I am getting some really hard to understand failures.

I have three managed machines, running Ubuntu 14.04.03 — which normally has Linux kernel 3.13. I run Ansible 1.9.4 on my Mac (OS 10.10.5) to manage them. The managed machines are VMs in a cloud service, on a datacenter subnet, and I use an OpenVPN tunnel between them and my Mac. I have a playbook, and it runs. All is well. BTW: I have no ansible.cfg anywhere (I set the SL_INVENTORY envar to point at the appropriate dynamic inventory program), and none of my SSH config files say anything about TCPKeepAlive or ServerAlive.

Then on the managed machines I run (as root) apt-get install linux-generic-lts-wily, which upgrades the kernel to 4.2. After that my Ansible is flaky. When I run my playbook, usually one or two of the three managed machines fails the gather-facts; the remainder fail in the first few tasks. Which fails where is variable, so I think this is due to some intermittent problem.

Thinking it might be a connectivity flakiness, I created ansible.cfg that specifies a timeout of 100, and set TCPKeepAlive no and ServerAliveCountMax 100 in my SSH client config (leaving ServerAliveInterval unspecified everywhere, which I think disables that function). These did not solve the problem. The playbook runs to full failure in under 30 seconds.

The failure message shows a truncated reply from the remote machine and then some SSH debugging that ends with a complaint about a broken pipe. Here is an example from the gather-facts step (I passed -vvvv to ansible-playbook):

GATHERING FACTS ***************************************************************
<10.140.155.166> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.162> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.166> REMOTE_MODULE setup
<10.140.155.161> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.162> REMOTE_MODULE setup
<10.140.155.161> REMOTE_MODULE setup
<10.140.155.166> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=“/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r” -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.166 /bin/sh -c ‘mkdir -p $HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573 && echo $HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573’
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=“/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r” -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.162 /bin/sh -c ‘mkdir -p $HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274 && echo $HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274’
<10.140.155.161> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=“/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r” -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.161 /bin/sh -c ‘mkdir -p $HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077 && echo $HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077’
<10.140.155.166> PUT /var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpFwQwuI TO setup
<10.140.155.161> PUT /var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpcY2it_ TO setup
<10.140.155.162> PUT /var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpuXklyy TO setup
<10.140.155.166> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=“/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r” -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.166 /bin/sh -c ‘sudo -k && sudo -H -S -p “[sudo via ansible, key=spxlnvdjayufeyagwywqhtdtptisrnej] password: " -u root /bin/sh -c '”’“‘echo BECOME-SUCCESS-spxlnvdjayufeyagwywqhtdtptisrnej; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python setup’”‘"’’
<10.140.155.161> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=“/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r” -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.161 /bin/sh -c ‘sudo -k && sudo -H -S -p “[sudo via ansible, key=dprrxybimhltrqziynaikzahzsgdjmlc] password: " -u root /bin/sh -c '”’“‘echo BECOME-SUCCESS-dprrxybimhltrqziynaikzahzsgdjmlc; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python setup’”‘"’’
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=“/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r” -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.162 /bin/sh -c ‘sudo -k && sudo -H -S -p “[sudo via ansible, key=bnojgeocfbeoxbgakzuluppvtxaviuyr] password: " -u root /bin/sh -c '”’“‘echo BECOME-SUCCESS-bnojgeocfbeoxbgakzuluppvtxaviuyr; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python setup’”‘"’’
failed: [10.140.155.166] => {“failed”: true, “parsed”: false}
BECOME-SUCCESS-spxlnvdjayufeyagwywqhtdtptisrnej
{“verbose_override”: true, “changed”: false, “ansible_facts”: {**<lots of gather-facts output here, but it is not complete/>*OpenSSH_6.2p2, OSSLShim 0.9.8r 8 Dec 2011
debug1: Reading configuration data /Users/mspreitz/.ssh/config
debug1: /Users/mspreitz/.ssh/config line 8: Applying options for 10.

debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug1: /etc/ssh_config line 102: Applying options for *
debug1: auto-mux: Trying existing master
debug2: fd 3 setting O_NONBLOCK
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_client_request_alive: done pid = 9654
debug3: mux_client_request_session: session request sent
debug1: mux_client_request_session: master session id: 2
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 0
Shared connection to 10.140.155.166 closed.

From the same run, here is a failure at a task for the one machine that succeeded to gather-facts in this case:

TASK: [bootstrap | add any dependent repository keys from url] ****************
<10.140.155.162> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.162> REMOTE_MODULE apt_key url=http://10.140.132.215:8082/Release.pgp
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=“/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r” -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.162 /bin/sh -c ‘mkdir -p $HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812 && echo $HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812’
<10.140.155.162> PUT /var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpxHKer5 TO apt_key
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=“/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r” -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.162 /bin/sh -c ‘sudo -k && sudo -H -S -p “[sudo via ansible, key=jsmmpfgpouhdlwmcvgmwvbhidafftccn] password: " -u root /bin/sh -c '”’“‘echo BECOME-SUCCESS-jsmmpfgpouhdlwmcvgmwvbhidafftccn; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python apt_key’”‘"’’
failed: [10.140.155.162] => (item={‘repo’: u’deb http://10.140.132.215:8082/ docker main’, ‘key_url’: u’http://10.140.132.215:8082/Release.pgp’}) => {“failed”: true, “item”: {“key_url”: “http://10.140.132.215:8082/Release.pgp”, “repo”: “deb http://10.140.132.215:8082/ docker main”}, “parsed”: false}
BECOME-SUCCESS-jsmmpfgpouhdlwmcvgmwvbhidafftccn
OpenSSH_6.2p2, OSSLShim 0.9.8r 8 Dec 2011
debug1: Reading configuration data /Users/mspreitz/.ssh/config
debug1: /Users/mspreitz/.ssh/config line 8: Applying options for 10.*
debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug1: /etc/ssh_config line 102: Applying options for *
debug1: auto-mux: Trying existing master
debug2: fd 3 setting O_NONBLOCK
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_client_request_alive: done pid = 9658
debug3: mux_client_request_session: session request sent
debug1: mux_client_request_session: master session id: 2
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 0
Shared connection to 10.140.155.162 closed.

In desperation I captured all packets on the VPN network interface on my Mac during that run, then looked at it with Wireshark. I see one long-lived connection for each of the three managed machines, plus one short-lived connection for 10.140.155.162.

I see no end to the long-lived connections. Just to make sure it was not due to a missing buffer flush in the packet capture, I did a ping of one managed machine after ansible-playbook finished; I see the ping in the captured packets. So how can Ansible suffer from truncated replies when the connections stay open? How can the connections stay open after ansible-playbook exits?

The short-lived connection is even stranger. It comes late in the trace. The client closes it soon after the completion of the three-packet handshake that opens the connection. The client sends no TCP payload at all, not even the “client hello” of TLS. The server sends a “server hello”, which prompts the client to respond with a RST. What the heck is going on with that?

Thanks for any clues,
Mike

I'm not sure but it seems it might be related to this issue,
https://github.com/ansible/ansible/issues/13876

it seems that an interaction between openssh and the newer kernels is
closing the ssh connection sooner than it should.