Ansible fact gathering is slow with host_key_checking = True

Spawned from Conversation with Michael on Twitter https://twitter.com/esacteksab/status/514558427217936384

Uncommenting host_key_checking = False, a playbook runs in 35s
Commenting host_key_checking = False, the playbook runs in 9m25s

But with it uncommented, ~10% of the servers return: “SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh”

With it commented, no failures, I’m able to communicate with all servers.

This is a topic for to troubleshoot further, because Twitter and 140 chars isn’t all that great.

Ansible is 1.7.1 on Ubuntu 14.04
Servers are a combination of Ubuntu 12.04 and 14.04

Oh, FWIW, I’m touching over 350 servers with this playbook and gathering facts from all of them.

Ok Barry,

We’ll get you sorted before you wander off and lose a limb :slight_smile:

These things seem to be unrelated.

(A)

This has happened in the past when the host key of a host doesn’t appear to Ansible’s ssh.py connection type to be in the known hosts file, and it creates a serial lock to ask you the question about whether it should be added - but for whatever reason, knew it was actually there. The result of this is that --forks is not used on the first task per host, which makes things not be parallel. It’s frustrating.

This was fixed long ago, when we added knowledge about hashed known_hosts entries, and should be quite good today, especially on a well tested OS like 14.04, basically at the top of our test matrix. Finding it again now is curious.

I’d worry if something else might be interferring with the lock. My first question is if (maybe privately), we could see your known_hosts file?

So we’re not quite out of that territory yet with host key checking on, but I’m still curious about why it may still be doing that.

There may be a slim chance you’re actually using an older ansible version, or they are hashed weirdly for some reason.

I’ll assume this is happening with “-c ssh”.

(I’d also be curious if this happens on the development branch, but I don’t anticipate any changes there)

(B)

On the second question, I’m expecting these 10 hosts are consistently doing that between runs, as in the same hosts?

Can I get the result of an /usr/bin/ansible hostname -m ping -vvvv -c ssh against one of them?

That will engage SSH debug mode and tell us a little more about what may be up.

They could actually be down, but I’m guessing you checked that. That being returned extraneously is not expected.

It could also be that ansible_ssh_port or something needs to be set in inventory or whatever, and it’s not normally set, firewall issues, or things like that?

Let’s start with the “-vvvv” part.

Thanks!

"With it commented, no failures, I’m able to communicate with all servers. "

This part is a little interesting.

Turning off host checking and going slow you can talk to all your hosts. Going fast you cannot?

(If this is repeatable, I wonder if maybe you have an SSH jumphost configured that might be getting overwhelmed? Or perhaps something similar on the network?)

Can I ask what --forks is set to?

It is not consistent each attempt as far as which hosts fail. On one attempt a server will fail, on the next attempt the same server will not fail and if I attempt to gather facts manually after it failed, it is able to gather facts successfully each time.

But here is a host that failed on the last attempt:

/usr/bin/ansible server74.prod.domain -m ping -vvvv -c ssh
<server74.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso
<server74.prod.domain> REMOTE_MODULE ping
<server74.prod.domain> EXEC [‘ssh’, ‘-C’, ‘-vvv’, ‘-o’, ‘ControlMaster=auto’, ‘-o’, ‘ControlPersist=5m’, ‘-o’, ‘ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r’, ‘-o’, ‘StrictHostKeyChecking=no’, ‘-o’, ‘Port=3422’, ‘-o’, ‘KbdInteractiveAuthentication=no’, ‘-o’, ‘PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey’, ‘-o’, ‘PasswordAuthentication=no’, ‘-o’, ‘ConnectTimeout=10’, ‘server74.prod.domain’, u"/bin/sh -c ‘LANG=C LC_CTYPE=C /usr/bin/python’"]
server74.prod.domain | success >> {
“changed”: false,
“ping”: “pong”
}

–forks was set to 50 and I saw ~35 hosts fail
–forks set to 25, only 5 failed and it ran in 47s
–forks set to 15, none failed and it ran in 53s
–forks set to 20, none failed and it ran in 45s

The above are all with host checking off.

Here is another “twist”. With --forks passed, if the fact gathering doesn’t fail, a task will, and like fact gathering before, it’s never the same task that fails.

Task fails with: “ssh connection closed waiting for sudo or su password prompt”

With host checking on

–forks 25 = 10m
–forks 50 = 10m

FWIW, If I set forks: 50 in /etc/ansible/ansible.cfg – it still acts as if it is set to 5, only when I pass --forks 50 in the command does it actually seem to run at 50.

Also, no “old” version of Ansible

which ansible
/usr/bin/ansible

/usr/bin/ansible --version
ansible 1.7.1

Hope this helps, but fear it may add to the confusion.

It is not consistent each attempt as far as which hosts fail. On one attempt a server will fail, on the next attempt the same server will not fail and if I attempt to gather facts manually after it failed, it is able to gather facts successfully each time.

But here is a host that failed on the last attempt:

/usr/bin/ansible server74.prod.domain -m ping -vvvv -c ssh
<server74.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso
<server74.prod.domain> REMOTE_MODULE ping
<server74.prod.domain> EXEC [‘ssh’, ‘-C’, ‘-vvv’, ‘-o’, ‘ControlMaster=auto’, ‘-o’, ‘ControlPersist=5m’, ‘-o’, ‘ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r’, ‘-o’, ‘StrictHostKeyChecking=no’, ‘-o’, ‘Port=3422’, ‘-o’, ‘KbdInteractiveAuthentication=no’, ‘-o’, ‘PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey’, ‘-o’, ‘PasswordAuthentication=no’, ‘-o’, ‘ConnectTimeout=10’, ‘server74.prod.domain’, u"/bin/sh -c ‘LANG=C LC_CTYPE=C /usr/bin/python’"]
server74.prod.domain | success >> {
“changed”: false,
“ping”: “pong”
}

–forks was set to 50 and I saw ~35 hosts fail
–forks set to 25, only 5 failed and it ran in 47s
–forks set to 15, none failed and it ran in 53s
–forks set to 20, none failed and it ran in 45s

The above are all with host checking off.

Here is another “twist”. With --forks passed, if the fact gathering doesn’t fail, a task will, and like fact gathering before, it’s never the same task that fails.

Task fails with: “ssh connection closed waiting for sudo or su password prompt”

With host checking on

–forks 25 = 10m
–forks 50 = 10m

FWIW, If I set forks: 50 in /etc/ansible/ansible.cfg – it still acts as if it is set to 5, only when I pass --forks 50 in the command does it actually seem to run at 50.

Also, no “old” version of Ansible

which ansible
/usr/bin/ansible

/usr/bin/ansible --version
ansible 1.7.1

Hope this helps, but fear it may add to the confusion.

I should add, I am able to SSH into these hosts w/o being prompted to accept a key. I am able to hit these servers with Fabric, again w/o any prompting for a key.

It is not consistent each attempt as far as which hosts fail. On one
attempt a server will fail, on the next attempt the same server will not
fail and if I attempt to gather facts manually after it failed, it is able
to gather facts successfully each time.

But here is a host that failed on the last attempt:

/usr/bin/ansible server74.prod.domain -m ping -vvvv -c ssh
<server74.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso
<server74.prod.domain> REMOTE_MODULE ping
<server74.prod.domain> EXEC ['ssh', '-C', '-vvv', '-o',
'ControlMaster=auto', '-o', 'ControlPersist=5m', '-o',
'ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r', '-o',
'StrictHostKeyChecking=no', '-o', 'Port=3422', '-o',
'KbdInteractiveAuthentication=no', '-o',
'PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey',
'-o', 'PasswordAuthentication=no', '-o', 'ConnectTimeout=10',
'server74.prod.domain', u"/bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python'"]
server74.prod.domain | success >> {
    "changed": false,
    "ping": "pong"
}

Ok so this one is successful and SSH debug levels are not helping. I'm
going to need to see one that fails, unfortunately. That may need a
capture from the long run....

--forks was set to 50 and I saw ~35 hosts fail
--forks set to 25, only 5 failed and it ran in 47s
--forks set to 15, none failed and it ran in 53s
--forks set to 20, none failed and it ran in 45s

The above are all with host checking off.

Here is another "twist". With --forks passed, if the fact gathering
doesn't fail, a task will, and like fact gathering before, it's never the
same task that fails.

This feels to me like there may be some problem keeping ControlPersist
sockets open.

One thing to note is they do typically consume about ~1MB per host, though
at -f 50 this shouldn't be a problem.

Also that version of Ubuntu should be perfectly fine.

I've occasionally heard of issues with network hardware in the way - a
particularly badly misconfigured switch clamping things down.

In this particular case, once discovered, the user was soon managing
thousands of nodes at a time.

Though it's hard to say. More digging is definitely required.

Task fails with: "ssh connection closed waiting for sudo or su password
prompt"

With host checking on

--forks 25 = 10m
--forks 50 = 10m

FWIW, If I set forks: 50 in /etc/ansible/ansible.cfg -- it still acts as
if it is set to 5, only when I pass --forks 50 in the command does it
actually seem to run at 50.

This is curious. Possibly a permissions issue on ansible.cfg keeping it
from being read, or the value out of the right section.
[defaults] vs [default] or something is possible.

If you can email us the file, I'd be interested in seeing it.

Again, also interested in your known_hosts to try to see if we can tell why
it might not be detecting that your host is in the file.

That SSH is asking shows it's there, but for some reason Ansible is
thinking it may need to ask you.

Again, about 65-75% of our users are using these default options vs
paramiko - and haven't heard this reported recently - so hope to get to
the bottom of this.

Help with the above questions and info would be greatly appreciated!

Support Request #2904 has known_hosts file attached to it

Hopefully this is the pertinent part from failed facts gathering:

<server1020.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso
<server1020.prod.domain> REMOTE_MODULE setup CHECKMODE=True
<server1020.prod.domain> EXEC [‘ssh’, ‘-C’, ‘-vvv’, ‘-o’, ‘ControlMaster=auto’, ‘-o’, ‘ControlPersist=5m’, ‘-o’, ‘ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r’, ‘-o’, ‘StrictHostKeyChecking=no’, ‘-o’, ‘Port=3422’, ‘-o’, ‘KbdInteractiveAuthentication=no’, ‘-o’, ‘PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey’, ‘-o’, ‘PasswordAuthentication=no’, ‘-o’, ‘ConnectTimeout=10’, ‘server1020.prod.domain’, u"/bin/sh -c ‘LANG=C LC_CTYPE=C /usr/bin/python’“]
ok: [server88.prod.domain]
<server3033.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso
<server3033.prod.domain> REMOTE_MODULE setup CHECKMODE=True
<server3033.prod.domain> EXEC [‘ssh’, ‘-C’, ‘-vvv’, ‘-o’, ‘ControlMaster=auto’, ‘-o’, ‘ControlPersist=5m’, ‘-o’, ‘ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r’, ‘-o’, ‘StrictHostKeyChecking=no’, ‘-o’, ‘Port=3422’, ‘-o’, ‘KbdInteractiveAuthentication=no’, ‘-o’, ‘PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey’, ‘-o’, ‘PasswordAuthentication=no’, ‘-o’, ‘ConnectTimeout=10’, ‘server3033.prod.domain’, u”/bin/sh -c ‘LANG=C LC_CTYPE=C /usr/bin/python’“]
fatal: [server4214.prod.domain] => SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh
<server1028.prod.domain> ESTABLISH CONNECTION FOR USER: bmorriso
<server1028.prod.domain> REMOTE_MODULE setup CHECKMODE=True
<server1028.prod.domain> EXEC [‘ssh’, ‘-C’, ‘-vvv’, ‘-o’, ‘ControlMaster=auto’, ‘-o’, ‘ControlPersist=5m’, ‘-o’, ‘ControlPath=/home/bmorriso/.ansible/cp/ansible-ssh-%h-%p-%r’, ‘-o’, ‘StrictHostKeyChecking=no’, ‘-o’, ‘Port=3422’, ‘-o’, ‘KbdInteractiveAuthentication=no’, ‘-o’, ‘PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey’, ‘-o’, ‘PasswordAuthentication=no’, ‘-o’, ‘ConnectTimeout=10’, ‘server1028.prod.domain’, u”/bin/sh -c ‘LANG=C LC_CTYPE=C /usr/bin/python’"]
fatal: [server1020.prod.domain] => SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh

specifically server1020.prod.domain

And I was trying to get the tasks to fail as they had many times earlier – nothing failed. Everything worked as expected and completed in 59s. I’m not convinced its fixed, but it’s behaving. I’ll poke at it in the AM. It was too easy to reproduce earlier.

Hmm, curious.

Yeah there’s not really any extra SSH debug detail in the above.

The error in question occurs in two places - one when the pipe slams shut for no good reason, and another when ssh exists with error 255 (aka unknown error).

We’re still looking into the known hosts awareness question.

Hi Barry,

One thing I did notice when testing your configuration was that, with my default ulimit settings, large -f settings were causing similar tracebacks and failures. In my case setting ulimit -u 4096 (may also have to do ulimit -f 4096) resolved that issue. I noticed this when using the “ansible” command vs. “ansible-playbook”, the later of which may have been hidding the underlying issue.

I also meant to ask if you could do a simple ansible -m ping all test (before and after changing the ulimit settings), to see if you still see slowness with that simple test or if it is directly related to fact-gathering.

Thanks!