Copy/Template modules hang on AIX - sh.py

I’ve been having issues running Ansible against AIX; specifically with the copy/template modules.

Periodically, copy/template plays will hang; either for a long time (read hours, as in leave it overnight and it might be completed the next day) or indefinitely. After reviewing debug output for a number of these instances, it appears to be an issue that occurs in the sh.py code under runner. The problem is in the ‘checksum’ function. Below is an example debug output of where the copy/template module will hang:

<aix14.mgmt.loc> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=“/home/ansible/.ansible/cp/ansible-ssh-%h-%p-%r” -o Port=22 -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 aix14.mgmt.loc /bin/sh -c ‘sudo -k && sudo -H -S -p “[sudo via ansible, key=sfknwylttinwgjiawaunhugtrjbqdymg] password: " -u root /bin/sh -c '”’“'echo SUDO-SUCCESS-sfknwylttinwgjiawaunhugtrjbqdymg; rc=flag; [ -r “/etc/ntp.conf” ] || rc=2; [ -f “/etc/ntp.conf” ] || rc=1; [ -d “/etc/ntp.conf” ] && rc=3; python -V 2>/dev/null || rc=4; [ x”$rc" != “xflag” ] && echo “${rc} /etc/ntp.conf” && exit 0; (python -c ‘"’“'”‘"’“'”‘"’“'import hashlib; print(hashlib.sha1(open(”/etc/ntp.conf", “rb”).read()).hexdigest())‘"’“'”‘"’“'”‘"’“’ 2>/dev/null) || (python -c '”‘"’“'”‘"’“'”‘“'import sha; print(sha.sha(open(”/etc/ntp.conf", “rb”).read()).hexdigest())’“'”‘"’“'”‘"’“'”’ 2>/dev/null) || (echo “0 /etc/ntp.conf”)‘"’"‘’

This will happen during random copy/template plays, not necessarily for the same file as in the example above. The issue is reproducible, but not consistently; 1 in 5 runs or more may have the issue. It appears that the file actually copies over successfully, and then the session hangs. If I run a “who -u” on the AIX host, and “kill ” the pid of the SSH session, the playbook will continue on. I can confirm this happens using SFTP, and with “scp_if_ssh = True”. It also happens with “pipelining = True” configured.

After digging about on the interwebs, I have found a handful references to issues with the version of python included by IBM as part of the Linux-for-AIX toolbox. The version we’re using is from http://www.perzl.org/aix/, which doesn’t suffer the same issues (see https://github.com/ansible/ansible-modules-core/issues/80). I tried substituting ‘hashlib.sha1’ with ‘hashlib._md5’, and was able to reproduce the same hanging issue. As part of some references online to other folks using Ansible to manage AIX, I’ve symlink’d /bin/md5sum to /bin/csum; this also did not fix our issues. I can also periodically reproduce the issue when running a single ad-hoc ansible command using the copy module.

Below is a truss output from an AIX box where this issue occurs; this is a truss against the ssh process of the user connected in from Ansible. I’m by no means an expert at debugging truss output, however, it appears that the /bin/sh is called, then it forks off a subprocess, which right away sends a SIGCHLD, and then the process hangs with “close(8) (sleeping…)”. This is where it will hang for a looooonnnnggg time. The PID that gets forked off (24379542 in the example below), ends up in a ‘’ state.

kwrite(4, "\0\00304 / b i n / s h "…, 776) = 776
kfcntl(7, F_DUPFD, 0x00000000) = 9
kfcntl(7, F_DUPFD, 0x00000000) = 10
sigprocmask(0, 0xF02B4970, 0xF02B4978) = 0
kfork() = 24379542
thread_setmymask_fast(0x00000000, 0x00000000, 0x00000000, 0xD052A400, 0x00000000, 0x11
960029, 0x00000000) = 0x00000000
Received signal #20, SIGCHLD [caught]
sigprocmask(2, 0xF02B4970, 0x2FF21E80) = 0
_sigaction(20, 0x00000000, 0x2FF21F30) = 0
thread_setmymask_fast(0x00080000, 0x00000000, 0x00000000, 0x11960029, 0x00000003, 0x00
000000, 0x00000000) = 0x00000000
kwrite(6, “\0”, 1) = 1
ksetcontext_sigreturn(0x2FF21FE0, 0x2FF22FF8, 0x2002D0D0, 0x0000D032, 0x00000003, 0x00
000000, 0x00000000)
close(8) (sleeping…)

In the interest of disclosing all information, I also notice weird behavior with the ‘w’ command when trying to determine if Ansible has an SSH session open on a host where a playbook is hanging. The ‘w’ command will hang for a few seconds when it hits the user logged in and running the Ansible playbook. When I run a truss against the ‘w’ command, I get the output below. the command is getting the status of the user’s pts, then it gets a SIGALRM, which apparently means the system call is taking too long to respond:

kopen(“/dev/pts/4”, O_RDONLY|O_NONBLOCK) (sleeping…)
kopen(“/dev/pts/4”, O_RDONLY|O_NONBLOCK) Err#4 EINTR
Received signal #14, SIGALRM [caught]
_sigaction(14, 0x0FFFFFFFFFFFEEB0, 0x0FFFFFFFFFFFEEE0) = 0
ksetcontext_sigreturn(0x0FFFFFFFFFFFF000, 0x0000000000000000, 0x0FFFFFFFFFFFFFE8, 0x800000000000D032, 0x3FFC000000000003, 0x00000000000000E8, 0x0000000000000000, 0x0000000000000000)
statx(“/dev/pts/4”, 0x0FFFFFFFFFFFF618, 176, 0) = 0
incinterval(0, 0x0FFFFFFFFFFFF4F8, 0x0FFFFFFFFFFFF518) = 0
statx(“/dev/pts”, 0x0FFFFFFFFFFFF618, 176, 0) = 0
statx(“/dev/pts/4”, 0x0FFFFFFFFFFFF638, 176, 0) = 0
ansible pts/4 03:15PM 36 0 0 -
kwrite(1, " a n s i b l e p t s"…, 62) = 62
kread(3, “\0\0\0\0\0\0\0\0\0\0\0\0”…, 4096) = 1136
_sigaction(14, 0x0FFFFFFFFFFFF4F0, 0x0FFFFFFFFFFFF520) = 0
incinterval(0, 0x0FFFFFFFFFFFF4F8, 0x0FFFFFFFFFFFF518) = 0

My environment is as follows:

Ubuntu 12.04
Ansible 1.8.2 (installed from the Ansible PPA)
AIX 7.1 (have reproduced for sure on TL2SP4, and TL1SP0)
python 2.7.5

OK, I ‘think’ I’ve gotten to the root of this particular problem, and it doesn’t appear to have anything to do with python or binaries to generate checksums.

I was able to reproduce the same command hanging issue by just performing an SSH with the same arguments that Ansible uses. After process of elimination, I was able to reasonably conclude that my issue is with pseudo-tty alocation on AIX.

In Ansible, part of the SSH arguments that are passed are ‘-tt’. Looking at the man pages for ‘-t’:

Forces pseudo-tty allocation. This can be used to execute arbitrary screen-based programs on a remote machine, which can be very useful, e.g. when implementing menu services. Multiple -t options force tty allocation, een if ssh has no local tty

I’m guessing there is something hokey using pseudo-tty allocation in AIX, or perhaps how their implementation of the OpenSSH client handles it (openssh version from IBM as of AIX7.1.2.4 is version 6.0p1). I did some comparisons of debug output from SSH sessions that hung and those that didn’t, but it was hard to determine where the issue may have been just from that.

It appears that pseudo-tty’s are only required when needing an interactive shell. I have not been able to find the reason that Ansible ‘needs’ a pseudo-tty. There were some vague references online that default configured sudoers requires tty (requiretty); when actually the DEFAULT for sudo is to NOT require a tty. This appears to have been changed from the default by RHEL around RHEL5. In fact, in Ansible’s ssh.py, there are comments that lead me to believe that they DON’T want to utilize a tty at all (issues with the modules due to python interactive mode inserting empty lines).

Testing my ‘fix’ involved changing the line in ssh.py that adds ‘-tt’ to the ssh_cmd array, and change it to ‘-T’ (this disables pseudo-tty allocation). After this change, I have not been able to reproduce my issues with copy/template plays hanging, or any others for that matter.

ttys are needed for privilege escalation (sudo, su, pbrun) when they
prompt for passwords. If you use sudo NOPASSWD or login as root, you
are fine without it.

Thanks for the explanation Brian.

From the outset, we had setup a user with NOPASSWD in sudoers, so this never even came up. Just for my own curiosity, I did run an ad-hoc copy command using a different user that has sudo access, but without NOPASSWD. I confirmed in the verbose ansible command output that the ‘-tt’ was changed to ‘-T’, and the command does indeed hang:

ansible -vvvv -m copy -a “src=/some/file dest=/tmp” -u -s -kK

Just as you state, this must be due to not having a tty to interactively pass the password to sudo. I can see the last entry in the servers auth.log that the password for the user was accepted, and a couple processes are sitting in the process table. The process associated with the original user used to connect is in ‘_sigsuspend’ state when I check it using truss; so I’m guessing it will stay, as it’s waiting on a password.

I’m going to see if I can get an APAR opened with IBM on this issue, since it appears to be broken behavior in AIX. However, for the time being, I’m just leaving the ‘-T’ in the code, and I’ll just have to document and maintain for us going forward…not ideal, but we will always use a service account with NOPASSWD for all or just the commands needed.

Thanks

Matthew,

Thanks, this is great documentation for any other AIX user that hits
this. Keep us posted on the IBM response.

It’s been a few months since this was posted, but I recently got some updated information back from IBM and thought I’d share here. This is specific to using ‘ssh -tt’ to connect to sshd running on AIX.

From IBM support:

Here is a summary of my observations till now -

1. The server child process terminates since it is the process that gets

scheduled immediately after a fork. It sets up the shell, sets other

environment variables and goes ahead to exec the command and it

completes its execution. Later the server parent process gets executed

which tries to close the slave side of the pseudo terminal (since it

will be attached to the ssh client) and it gets struck there since the

server child process has already terminated.

2. In case there is some scheduler switching that happens between these two process and if the child process is still present when parent tries to close the slave side of pseudo terminal, then the parents goes ahead and send the CHANNEL_WINDOW_ADJUST packet to ssh client program followed by SSH2_MSG_CHANNEL_SUCCESS / FAILURE message and we dont see any issue

← This is the working case.

3. My further steps will be to see if this behaviour is also seen on a Linux machine, if yes then I will report it to the community and get a fix for the same else I will work on a solution that is specific to AIX

(maybe through some locks or other methods which can introduce the

synchronisation between these processes)

This is as of 12-14-15. I was able to see this behavior connecting from Linux (various flavors and versions) to AIX, from AIX to AIX (both 6.1 and 7.1), as well as from AIX connecting to sshd on localhost. This is with different versions of OpenSSH distributed with AIX (depends on AIX level and any additional patched versions of OpenSSH installed).

Thank you very very very much Matthew, we have the same problem on aix clients and it is a bloker issuer. I confirm that changing the ssh.py with -T option fixes the problem.

There is no need to change ssh.py you can set -T in the ssh arguments (globally or for the group/hosts) and it will override the -tt as user options get appended.

usetty = False seems to do the trick as well.

Turning the setting to False removes the -tt option from the ssh command line arguments