Issues with playbook ran getting stuck on gathering facts

I’ve been using ansible for my current project for a few months now. I have a playbook to pushout configs to a cluster of about 200 servers every 30 mins. It’s been going well so far up until recently it hangs indefinitely while gathering tasks.
Additional -vvv arguments have no provided any extra insight. This issue is reproducible every single time.

I tried a different playbook I run less frequently and get the following error on multiple hosts.

SSH Error: data could not be sent to the remote host. Make sure this host can be reached over ssh

Of course I can ssh into this hosts from my control machine with no issues. Any help is appreciated.

Thanks!

When I Ctrl-C to break out of the playbook while it’s hanging on gathering hosts I get the following error.

Traceback (most recent call last):
File “/usr/bin/ansible-playbook”, line 324, in
sys.exit(main(sys.argv[1:]))
File “/usr/bin/ansible-playbook”, line 264, in main
pb.run()
File “/usr/lib/python2.6/site-packages/ansible/playbook/init.py”, line 348, in run
if not self._run_play(play):
File “/usr/lib/python2.6/site-packages/ansible/playbook/init.py”, line 739, in _run_play
self._do_setup_step(play)
File “/usr/lib/python2.6/site-packages/ansible/playbook/init.py”, line 629, in _do_setup_step
accelerate_port=play.accelerate_port,
File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 1485, in run
results = self._parallel_exec(hosts)
File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 1406, in _parallel_exec
while not result_queue.empty():
File “”, line 2, in empty
File “/usr/lib64/python2.6/multiprocessing/managers.py”, line 726, in _callmethod
kind, result = conn.recv()
IOError: [Errno 104] Connection reset by peer

can you check on the hosts that it gets 'stuck on'? see what the
ansible process is doing. Normally this only happens when the box has
resource issues or some device is causing the hang (fact gathering
checks devices and proc) specially network file systems.

So it was pretty much getting stuck on any node at random, no particular order and nothing interesting seemed to be going on on those nodes in question.
At any rate I decided to change my inventory files from using host name to fully qualified domain and everything worked fine after, so maybe some funny DNS business going on within our internal network. Thanks for the feedback at any rate!

Sadly not out of the woods yet. I’m now seeing Resource available error on subsequent ansible runs. As mentioned prior I have the playbook scheduled to run every 30mins on about 200 servers and I have set fork value to about 200 also, wondering if that is high for ansible? I added about 16 servers about a week ago, that was the only real change that happened.

File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 582, in _executor
exec_rc = self._executor_internal(host, new_stdin)
File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 785, in _executor_internal
return self._executor_internal_inner(host, self.module_name, self.module_args, inject, port, complex_args=complex_args)
File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 1032, in _executor_internal_inner
result = handler.run(conn, tmp, module_name, module_args, inject, complex_args)
File “/usr/lib/python2.6/site-packages/ansible/runner/action_plugins/normal.py”, line 57, in run
return self.runner._execute_module(conn, tmp, module_name, module_args, inject=inject, complex_args=complex_args)
File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 547, in _execute_module
res = self._low_level_exec_command(conn, cmd, tmp, become=self.become, sudoable=sudoable, in_data=in_data)
File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 1169, in _low_level_exec_command
in_data=in_data)
File “/usr/lib/python2.6/site-packages/ansible/runner/connection_plugins/ssh.py”, line 306, in exec_command
(p, stdin) = self._run(ssh_cmd, in_data)
File “/usr/lib/python2.6/site-packages/ansible/runner/connection_plugins/ssh.py”, line 111, in _run
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
File “/usr/lib64/python2.6/subprocess.py”, line 642, in init
errread, errwrite)
File “/usr/lib64/python2.6/subprocess.py”, line 1144, in _execute_child
self.pid = os.fork()
OSError: [Errno 11] Resource temporarily unavailable

if it is 'high' or not, depends on the resources available on the box
you run ansible, the good news is that it won't run the 200 forks if
it needs to, it will use that number or the total number of hosts
(whichever is lower). the bad news, it seems that the extra 16 forks
consume more resources than you have available, I would lower the
number of forks to adjust for that.

So I lowered # of forks to 150 from 200 and I’m back to the hanging on gathering facts. I was able to isolate that issue to one server. I tried to ssh to the server & got the following error -Connection to 10.20.41.155 timed out while waiting to read

I thought ansible would just skip this server and timeout if it’s having issues connecting. I checked ansible.cfg file & notice ssh timeout is

SSH timeout

timeout = 10

However additional arguments for ssh are.

Leaving off ControlPersist will result in poor performance, so use

paramiko on older platforms rather than removing it

ssh_args = -o ControlMaster=auto -o ControlPersist=30m

Thanks.

I don’t believe the above makes much difference as I commented it out and even with the default 60s same issue.

So I rebooted the misbehaving server and playbook is fine to run on everything now, doesn’t seem like the FQDN change I mentioned in my original “resolution” made any real difference as I reverted back to just host with no issues. I still believe ansible should have skipped this misbehaving server though if it couldn’t connect to it appropriately.

if ansible could not connect it would not be stuck, getting stuck
means it connected and was partially working.

Let’s assume that was the case, isn’t there a timeout period where ansible or ssh should give up & timeout? in some setting in the config?

I’m hoping the new “strategies” option in 2.0 helps to at least continue running tasks for the other servers, however the playbook would still be hung up on this one server if no timeout to kill it off.

http://docs.ansible.com/ansible/playbooks_strategies.html

Thanks.

Also as I mentioned previously I couldn’t even ssh into the server. I got a timeout error, so I’m not sure how ansible would connect when I couldn’t get in manually.

ssh and tcp will give up if the keepalive stops working for that
timeout, if the keepalive still works (network not an issue) but the
process is stuck on the other side, it will continue.

There is no way for ssh to know if the process is supposed to take a
long time or is stuck, it will only timeout if the network is an
issue. It is most probable that the machine was running out of
resources or stuck in a blocking request, the keepalive should still
work as it is minimal effort on the network stack side, even if the
machine became unresponsive to other network requests.

Gotcha. I suppose this explains why ansible ping worked. I’m curious to know how ansible ping works under the hood. I read the python code but it’s not super clear how it makes the connection and what mechanism it uses, so that way one could try to replicate it manually, doesn’t appear to be plain ssh either since that timed out for the server in question. Not a huge deal for now either way. Based on what you said below I’m guessing ping probably just does a basic TCPKeepAlive check. The documentation for ping says it also verifies a usable python version.

Everything seems ok so far, the only other issue I notice, which has been around for a few weeks is the following error below. Interestingly to fix this all I have to do is to ssh into server in question manually from ansible control machine, but then the issue will pop up again a few days later. This happens on handful of random machines. Any ideas?

2015-10-29 09:30:07,404 p=10292 u=ansible | failed: [hostname.com] => {“failed”: true, “parsed”: false}
2015-10-29 09:30:07,404 p=10292 u=ansible | OpenSSH_5.3p1, OpenSSL 1.0.1e-fips 11 Feb 2013
debug1: Reading configuration data /home/ansible/.ssh/config^M
debug1: Applying options for *^M
debug1: Reading configuration data /etc/ssh/ssh_config^M
debug1: Applying options for *^M
debug1: auto-mux: Trying existing master^M
debug1: mux_client_request_session: master session id: 2^M
[sudo via ansible, key=blah] password: Sorry, try again.
[sudo via ansible, key=blah] password: Sorry, try again.
[sudo via ansible, key=blah] password: Sorry, try again.
sudo: 3 incorrect password attempts
debug1: mux_client_request_session: master session id: 2^M

Thanks for your feedback as always!

I get the following below with -vvv option

ESTABLISH CONNECTION FOR USER: ansible
REMOTE_MODULE setup
EXEC ssh -C -v -o ControlMaster=auto -o ControlPersist=30m -o ControlPath=“/home/ansible/.ansible/cp/ansible-ssh-%h-%p-%r” -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 /bin/sh -c ‘sudo -k && sudo -H -S -p “[sudo via ansible, key=blah] password: " -u root /bin/sh -c '”’“‘echo BECOME-SUCCESS-koezcltlilaszzugkwutdiyfatzvtunk; LANG=C LC_CTYPE=C /usr/bin/python’”‘"’’

So I disabled openssh & decided to go with paramiko for ssh connection and that seems to take care of this particular situation for now. I’ll continue to observe.