Gathering Facts hangs when using become

Out of hundreds of hosts we have one host always hangs when gathering facts when using become.

Create a playbook with the following:

  • hosts: all

call the playbook with the -b flag

The last thing reported in the log is “Escalation succeeded”.

No matter how long I wait it never returns the prompt and there are two processes running on the remote host. One as my username and one as root.

There is no nfs (I have seen other issues where facts hang on nfs).

This is a RHEL5 host with python 2.7 installed. 1/4 of our hosts are RHEL5 with python 2.7 installed and they don’t have this issue

Anyone see this before? Any ideas how to troubleshoot it further?
I have tried the highest verbosity but there doesn’t appear to be any helpful information; compared it to successful runs and nothing is different.

Hi,

Problems I already encounter :

  • NFS Stale
  • rpm database corruption
  • lock on lvm

Regards,

Jy

Out of hundreds of hosts we have one host always hangs when gathering facts
when using become.

Create a playbook with the following:
- hosts: all

call the playbook with the -b flag

The last thing reported in the log is "Escalation succeeded".
No matter how long I wait it never returns the prompt and there are two
processes running on the remote host. One as my username and one as root.

There is no nfs (I have seen other issues where facts hang on nfs).

I almost always IO related, mounts, LVM, filsystem...

Anyone see this before? Any ideas how to troubleshoot it further?
I have tried the highest verbosity but there doesn't appear to be any
helpful information; compared it to successful runs and nothing is
different.

It happens sometimes and the easiest way(IMHO) to find where it hangs is to debug the setup module
https://docs.ansible.com/ansible/latest/dev_guide/debugging.html#debugging-remote

Then I run the module manually as described in the link with strace to identify where it hangs.

Thank you. I did the strace and it shows that it is just repeating the same two lines over and over again.
select(7, [4 6], , [4 6], {1, 0}) = 0 (Timeout)
wait4(29548, 0x7fff6a145c84, WNOHANG, NULL) = 0

When I checked the details of the select I get the following:
lsof -p 10984 -ad 4,6
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python2.7 10984 root 4r FIFO 0,7 0t0 819132068 pipe
python2.7 10984 root 6r FIFO 0,7 0t0 819132069 pipe

What happens right before it goes into this loop is probably the interesting part and can identifies what it trying to access.
If not you probably need to add print statements in the python code to identifies where it hangs.

Thank you for the print idea. I was able to trace it to the following commands:

/usr/bin/facter --puppet –json

Looks like the version of facter doesn’t like the --puppet option. WIll probably have to look into uprading it.

Thanks again.