Gathering Facts "hanging" for a single host

I have a host that, when I run a playbook against it, it fails at gathering facts. Unfortunately, it just hangs:

ansible-playbook -i test playbooks/get_release_versions/site.yml -f 10 -vvvv

PLAY [Grabs all release versions for auditing purposes] ***********************

GATHERING FACTS ***************************************************************

ESTABLISH CONNECTION FOR USER: ddecker

EXEC [‘ssh’, ‘-tt’, ‘-vvv’, ‘-o’, ‘ControlMaster=auto’, ‘-o’, ‘ControlPersist=60s’, ‘-o’, ‘ControlPath=/root/.ansible/cp/ansible-ssh-%h-%p-%r’, ‘-o’, ‘Port=22’, ‘-o’, ‘KbdInteractiveAuthentication=no’, ‘-o’, ‘PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey’, ‘-o’, ‘PasswordAuthentication=no’, ‘-o’, ‘User=ddecker’, ‘-o’, ‘ConnectTimeout=10’, ‘testhost01’, “/bin/sh -c ‘mkdir -p $HOME/.ansible/tmp/ansible-1386355046.1-188203869680440 && chmod a+rx $HOME/.ansible/tmp/ansible-1386355046.1-188203869680440 && echo $HOME/.ansible/tmp/ansible-1386355046.1-188203869680440’”]

REMOTE_MODULE setup

PUT /tmp/tmpEvKhjU TO /home/ddecker/.ansible/tmp/ansible-1386355046.1-188203869680440/setup

EXEC [‘ssh’, ‘-tt’, ‘-vvv’, ‘-o’, ‘ControlMaster=auto’, ‘-o’, ‘ControlPersist=60s’, ‘-o’, ‘ControlPath=/root/.ansible/cp/ansible-ssh-%h-%p-%r’, ‘-o’, ‘Port=22’, ‘-o’, ‘KbdInteractiveAuthentication=no’, ‘-o’, ‘PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey’, ‘-o’, ‘PasswordAuthentication=no’, ‘-o’, ‘User=ddecker’, ‘-o’, ‘ConnectTimeout=10’, ‘testhost01’, ‘/bin/sh -c 'dzdo -k && dzdo -H -S -p “[sudo via ansible, key=eeszsiqmrnksgtzkolmtvmhddnepwrbh] password: " -u root /bin/sh -c '”'“'/usr/bin/python /home/ddecker/.ansible/tmp/ansible-1386355046.1-188203869680440/setup; rm -rf /home/ddecker/.ansible/tmp/ansible-1386355046.1-188203869680440/ >/dev/null 2>&1'”'"''’]

^CTraceback (most recent call last):

File “/usr/bin/ansible-playbook”, line 268, in

sys.exit(main(sys.argv[1:]))

File “/usr/bin/ansible-playbook”, line 208, in main

pb.run()

File “/usr/lib/python2.6/site-packages/ansible/playbook/init.py”, line 262, in run

if not self._run_play(play):

File “/usr/lib/python2.6/site-packages/ansible/playbook/init.py”, line 505, in _run_play

self._do_setup_step(play)

File “/usr/lib/python2.6/site-packages/ansible/playbook/init.py”, line 452, in _do_setup_step

accelerate=play.accelerate, accelerate_port=play.accelerate_port,

File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 968, in run

results = [ self._executor(h, None) for h in hosts ]

File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 382, in _executor

exec_rc = self._executor_internal(host, new_stdin)

File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 471, in _executor_internal

return self._executor_internal_inner(host, self.module_name, self.module_args, inject, port, complex_args=complex_args)

File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 659, in _executor_internal_inner

result = handler.run(conn, tmp, module_name, module_args, inject, complex_args)

File “/usr/lib/python2.6/site-packages/ansible/runner/action_plugins/normal.py”, line 54, in run

return self.runner._execute_module(conn, tmp, module_name, module_args, inject=inject, complex_args=complex_args)

File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 348, in _execute_module

res = self._low_level_exec_command(conn, cmd, tmp, sudoable=sudoable)

File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 708, in _low_level_exec_command

rc, stdin, stdout, stderr = conn.exec_command(cmd, tmp, sudo_user, sudoable=sudoable, executable=executable)

File “/usr/lib/python2.6/site-packages/ansible/runner/connection_plugins/ssh.py”, line 219, in exec_command

rfd, wfd, efd = select.select([p.stdout, p.stderr], , [p.stdout, p.stderr], 1)

KeyboardInterrupt

For the above, I wait more than 10 minutes and it still just hangs. If I REMOVE the host, then it proceeds (and completes all other hosts). Additionall if I add gather_facts: False to my site.yml (for the playbook), it completes with no issue:

ansible-playbook -i test playbooks/get_release_versions/site.yml -f 10

PLAY [Grabs all release versions for auditing purposes] ***********************

TASK: [Get release version] ***************************************************

changed: [testhost01]

PLAY RECAP ********************************************************************

testhost01 : ok=1 changed=1 unreachable=0 failed=0

Does anyone know why this happens or what I can do to better debug and find out the reason why? When I do a ansible all -u ddecker -m setup it processes all hosts except this one, so it has something to do with the host and something that is happening that the fact gathering is requesting.

Thanks,
Drew

Let’s narrow this down further. 1) Run ansible-playbook with ANSIBLE_KEEP_REMOTE_FILES=1 and -vvvv 2) Check the debug output and file the filename that is sent to the remote host that has a “setup” filename. 3) Go to the remote machine and run the script by hand with python If the script hangs there, then we know that the setup script is having and issue, if not there may be a connection issue.

Yup - it hangs on the client:

OS and ansible version?

If you control-C where what do you get in the traceback when running locally?

control-C also just hangs (as you can see from my output in my previous comment) - so no traceback at all. If you mean the traceback from where Ansible runs when I run (and there is no output or failure), then for that it shows:

Traceback (most recent call last):
File “/usr/bin/ansible-playbook”, line 268, in
sys.exit(main(sys.argv[1:]))
File “/usr/bin/ansible-playbook”, line 208, in main
pb.run()
File “/usr/lib/python2.6/site-packages/ansible/playbook/init.py”, line 262, in run
if not self._run_play(play):
File “/usr/lib/python2.6/site-packages/ansible/playbook/init.py”, line 505, in _run_play
self._do_setup_step(play)
File “/usr/lib/python2.6/site-packages/ansible/playbook/init.py”, line 452, in _do_setup_step
accelerate=play.accelerate, accelerate_port=play.accelerate_port,
File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 968, in run
results = [ self._executor(h, None) for h in hosts ]
File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 382, in _executor
exec_rc = self._executor_internal(host, new_stdin)
File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 471, in _executor_internal
return self._executor_internal_inner(host, self.module_name, self.module_args, inject, port, complex_args=complex_args)
File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 659, in _executor_internal_inner
result = handler.run(conn, tmp, module_name, module_args, inject, complex_args)
File “/usr/lib/python2.6/site-packages/ansible/runner/action_plugins/normal.py”, line 54, in run
return self.runner._execute_module(conn, tmp, module_name, module_args, inject=inject, complex_args=complex_args)
File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 348, in _execute_module
res = self._low_level_exec_command(conn, cmd, tmp, sudoable=sudoable)
File “/usr/lib/python2.6/site-packages/ansible/runner/init.py”, line 708, in _low_level_exec_command
rc, stdin, stdout, stderr = conn.exec_command(cmd, tmp, sudo_user, sudoable=sudoable, executable=executable)
File “/usr/lib/python2.6/site-packages/ansible/runner/connection_plugins/ssh.py”, line 219, in exec_command
rfd, wfd, efd = select.select([p.stdout, p.stderr], , [p.stdout, p.stderr], 1)
KeyboardInterrupt

OS (of Client): RHEL 6.4 x86_64
Ansible Version: 1.3.4

Thanks!

This looks like accelerate not supporting sudo with password just yet?

I agree it should error at least…

– Michael

No, I think one of the fact functions is stuck waiting for input. We just need to narrow it down.

Drew, In the setup script on the remote system there is an init function that calls all the major groups of facts ...

     def __init__(self):
         self.facts = {}
         self.get_platform_facts()
         self.get_distribution_facts()
         self.get_cmdline()
         self.get_public_ssh_host_keys()
         self.get_selinux_facts()
         self.get_pkg_mgr_facts()
         self.get_lsb_facts()
         self.get_date_time_facts()
         self.get_user_facts()
         self.get_local_facts()
         self.get_env_facts()

Comment out all of those self.get_* calls and see if the the script completes. If it does, then uncomment each get one by one till the script hangs again.

Guys,

It still hangs, however, I tried running it with Python’s trace options:

$ python -m trace --trace ./setup

re.py(142): return _compile(pattern, flags).search(string)
— modulename: re, funcname: _compile
re.py(231): cachekey = (type(key[0]),) + key
re.py(232): p = _cache.get(cachekey)
re.py(233): if p is not None:
re.py(234): return p
setup(696): if m:
setup(694): for folder in os.listdir(sysdir):
setup(695): m = re.search(“(” + diskname + “\d+)”, folder)
— modulename: re, funcname: search
re.py(142): return _compile(pattern, flags).search(string)
— modulename: re, funcname: _compile
re.py(231): cachekey = (type(key[0]),) + key
re.py(232): p = _cache.get(cachekey)
re.py(233): if p is not None:
re.py(234): return p
setup(696): if m:
setup(694): for folder in os.listdir(sysdir):
setup(707): d[‘rotational’] = get_file_content(sysdir + “/queue/rotational”)
— modulename: setup, funcname: get_file_content
setup(2064): data = default
setup(2065): if os.path.exists(path) and os.access(path, os.R_OK):
— modulename: genericpath, funcname: exists
genericpath.py(17): try:
genericpath.py(18): st = os.stat(path)
genericpath.py(19): except os.error:
genericpath.py(20): return False
setup(2069): return data
setup(708): d[‘scheduler_mode’] = “”
setup(709): scheduler = get_file_content(sysdir + “/queue/scheduler”)
— modulename: setup, funcname: get_file_content
setup(2064): data = default
setup(2065): if os.path.exists(path) and os.access(path, os.R_OK):
— modulename: genericpath, funcname: exists
genericpath.py(17): try:
genericpath.py(18): st = os.stat(path)
genericpath.py(21): return True
setup(2066): data = open(path).read().strip()

Let me know if I can offer any more tests that may help with debugging this issue.

Thanks
Drew

Was anyone else able to give me some pointers that would able to let me debug this issue further?

It seems that python is hanging when trying to read the scheduler file for one of your disks. Add a print statement before line 2066 …

print “PATH:”,path

Here’s where it dies:

PATH: /sys/block/…/devices/pci0000:00/0000:00:1d.7/usb2/2-2/2-2:1.1/host14/target14:0:0/14:0:0:0/block/sr0/queue/rotational
PATH: /sys/block/…/devices/pci0000:00/0000:00:1d.7/usb2/2-2/2-2:1.1/host14/target14:0:0/14:0:0:0/block/sr0/queue/scheduler
PATH: /sys/block/…/devices/pci0000:00/0000:00:1d.7/usb2/2-2/2-2:1.1/host14/target14:0:0/14:0:0:0/block/sr0/size
PATH: /sys/block/…/devices/pci0000:00/0000:00:1d.7/usb2/2-2/2-2:1.1/host14/target14:0:0/14:0:0:0/block/sr0/queue/hw_sector_size
PATH: /sys/block/…/devices/pci0000:00/0000:00:07.0/0000:0e:00.1/host5/rport-5:0-0/target5:0:0/5:0:0:2/block/sdl/removable
PATH: /sys/block/…/devices/pci0000:00/0000:00:07.0/0000:0e:00.1/host5/rport-5:0-0/target5:0:0/5:0:0:2/block/sdl/queue/scheduler
^C^C^C^C^C^C^C^C^C^C^C^C <------- Dies right after the previous line ——

Are you able to cat that file?

jtanner@u1304:~$ cat /sys/block/…/devices/pci0000:00/0000:00:06.0/virtio2/block/vda/queue/scheduler
noop deadline [cfq]

Nope - can’t cat it at all. I can ls it, and see the following:

ls -l /sys/block/…/devices/pci0000:00/0000:00:07.0/0000:0e:00.1/host5/rport-5:0-0/target5:0:0/5:0:0:2/block/sdl/queue/scheduler

-rw-r–r-- 1 root root 4096 Sep 11 02:26 /sys/block/…/devices/pci0000:00/0000:00:07.0/0000:0e:00.1/host5/rport-5:0-0/target5:0:0/5:0:0:2/block/sdl/queue/scheduler

Is there something I can do to map this device to a physical device to know what the cause could be?

“sdl” is the device name per the path. /dev/sdl

If you can’t read the scheduler file, you may have a broken device or a buggy driver. You should check your dmesg outputs and the syslog for any obvious errors. Beyond that, you need to work with the relevant OS and hardware vendors to sort out why things are hanging.

James,

Thanks for the input. I went ahead and tested this on another system of the same Product type (Dell R810) that runs the same apps, etc (with possible BIOS upgrades and firmware updates already installed), and the setup script ran just fine. This system that is failing might just need some firmware updates installed on them - so I’ll get on that and post results in the future. Thanks for helping me sort this out from being a Ansible issue vs a server issue, and this clearly appears to be a server issue.

Thanks again!

This morning I went ahead and applied some firmware updates and rebooted the problem server. Once the server came up, I tested the “setup” script again, this time it ran all the way.

Thanks for the help in troubleshooting!

No problem. I learned something too …