Problems with the sudo keyword

This evening, I re-used a very simple playbook that was working a while back (0.8-devel). it didn’t work this time (errors shown below) unless I comment out the sudo keyword.

First, the playbook (simplified somewhat, but I ran it as shown here during my test):

TASK: [Install client ntp on a host running Red Hat alike Linux]
*********************
<fedora17-ci> REMOTE_MODULE yum pkg=ntp state=installed
failed: [fedora17-ci] => {"failed": true, "parsed": false}
invalid output was: sudo: unknown user: None

I'll have to check on this to see if I can reproduce it. I suspect
in replacing $SHELL with an explict /bin/sh last night some of the
quoting that was removed was actually needed.

Exception in thread Thread-4 (most likely raised during interpreter
shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
  File "/usr/lib/python2.7/threading.py", line 505, in run
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 298, in
_handle_workers
<type 'exceptions.TypeError'>: 'NoneType' object is not callable

I can get around the last part by specifying say -f 10 but it's also new.

Welcome to the development branch. We're tweaking the
multiprocessing usage some.

What was your other --forks value? Doesn't seem to explain anything
unless you had configured
it to be 1.

[…]

I’ll have to check on this to see if I can reproduce it. I suspect
in replacing $SHELL with an explict /bin/sh last night some of the
quoting that was removed was actually needed.

That would be great. Just as a double-check. Below I ran the same playbook on a Ubuntu 11.10 host, which still has the old ansible 0.8 devel. The playbook as it is worked just fine:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 11.10
Release: 11.10
Codename: oneiric
$ ansible-playbook --version
ansible-playbook 0.8

$ ansible-playbook install_ntp.yml --extra-vars “hosts=fedora17-ci user=root sudo=False” -k
SSH password:

PLAY [fedora17-ci] *********************

GATHERING FACTS *********************
ok: [fedora17-ci]

TASK: [Install client ntp on a host running Red Hat alike Linux] *********************
ok: [fedora17-ci]

TASK: [Ensure the service is started and is enabled at boot time] *********************
ok: [fedora17-ci]

PLAY RECAP *********************
fedora17-ci : ok=3 changed=0 unreachable=0 failed=0

[…]

Welcome to the development branch. We’re tweaking the
multiprocessing usage some.

What was your other --forks value?

I used the default. In the past I know it’s 5.

Doesn’t seem to explain anything unless you had configured it to be 1.

Nope. Didn’t do that. As shown from the above session log on the Ubuntu 11.10 host, the playbook worked just fine.

Regards,

– Zack

* zperry <zack.perry at sbcglobal.net> [2012/10/26 09:48]:

> [...]
>
> I'll have to check on this to see if I can reproduce it. I suspect
> in replacing $SHELL with an explict /bin/sh last night some of the
> quoting that was removed was actually needed.
>

That would be great. Just as a double-check. Below I ran the same
playbook on a Ubuntu 11.10 host, which still has the old ansible 0.8 devel.

Just a word of caution here -- On most linuxen, /bin/sh is really
bash, and even in POSIX mode it accepts some bashisms; but on other
(non-linux) systems, /bin/sh is a real Bourne shell, on which
bashisms are unsupported. If ansible is going to explicitly call
/bin/sh, there should be no unintentional bashims in there. We've
been bitten by unintentioanl bashims many, many times and they're
very difficult to track down, since they only to happen on "obscure"
(read: non-linux) systems.

Thanks for sharing your experience. In our case, it’s bash on both ends: Ubuntu (11.04/11.10) to Fedora 17. All 64bits too.

Regards,

– Zack

Even with systems that taut a ‘POSIX bourne shell’ you have differences, POSIX leaves many things vague and they get implemented differently. I’ve found annoying differences across the BSDs and they share a big part of the codebase.

We do almost nothing with bash. Search for "/bin/sh" in the code to
see what we call.

Shouldn't be any problems.

I'm going to work on fixing the multiprocessing stuff back to *closer*
(but better) than the way it originally was tonight. This will fix
0.9 Control-C handling and fix what I'm assuming is a race condition
in the items below.

Fixes to multiprocessing pushed. Sudo seems to work for me, let me
know if you see problems.

Fixes to multiprocessing pushed. Sudo seems to work for me, let me
know if you see problems.

I decided to post my test results here. More people may chip in to ensure it’s not an issue only in our environment.

This sudo issue still exists. The following are used for my testing:

  1. Between SL 6.3 to SL 6.3. Python 2.6.6
  2. Between SL 6.3 to Fedora 17. Python 2.6.6; 2.7.3
  3. Between Ubuntu 11.04 to Fedora 17. Python 2.7.1+; 2.7.3
    All running the latest 0.9 devel. The multiprocessing error no longer appears, even with default -f. But when using the simple test playbook shown below, regardless which one of the 3 pairs listed above, identical errors:

I fired up pdb in GNU emacs, and stepped through the command that I used in my post. After setting 3 break points, I got the following:

[…]
In def run()

[…]
192 for play in plays:
193 if not self._run_play(play):
194 break

which produced the following:

PLAY [fedora17-ci] *********************

TASK: [Install client ntp on a host running Red Hat alike Linux] *********************
ESTABLISH CONNECTION FOR USER: root on PORT 22 TO fedora17-ci
EXEC /bin/sh -c ‘mkdir -p /var/tmp/ansible-1351303572.75-209527505804614 && echo /var/tmp/ansible-1351303572.75-209527505804614’
REMOTE_MODULE yum pkg=ntp state=installed
PUT /tmp/tmpwRqyRt TO /var/tmp/ansible-1351303572.75-209527505804614/yum
EXEC /bin/sh -c ‘chmod a+r /var/tmp/ansible-1351303572.75-209527505804614/yum’
EXEC sudo -k && sudo -p "[sudo via ansible, key=mhfogsjplklfcjihijcaezpzhxzlbpso] password: " -u None /bin/sh -c ‘/usr/bin/python -tt /var/tmp/ansible-1351303572.75-209527505804614/yum; rm -rf /var/tmp/ansible-1351303572.75-209527505804614/ >/dev/null 2>&1’
failed: [fedora17-ci] => {“failed”: true, “parsed”: false}
invalid output was: sudo: unknown user: None

sudo: unable to initialize policy plugin

FATAL: all hosts have already failed – aborting

/usr/lib/pymodules/python2.7/ansible/playbook/init.py(194)run()

See the two lines in bold above. They match!

I am not so sure why the ‘root’ user suddenly became None in this case. BTW, are there any hints for debugging ansible somewhere? The official doc “Module Development” is very light in this regard… I would love to learn more and do more…

I hope the above trace is of some help in nailing down the real culprit.

Regards,

– Zack

So, AFAIK, it's *never* been possible to pass in the sudo boolean via
--extra-args. (Not that it shouldn't be possible).

sudo: True/False explicitly requires a boolean.

It seems you need to set sudo_user to something, though if sudo_user
is not set, it should be root.

You've given me enough to look into what's going on (will look today),
but your usage of passing in the sudo bit via --extra-vars in that way
is undoubtedly why no one else is seeing it -- it's unlikely anyone
else is doing that. Not saying it's wrong, it's probably just
unique.

Thanks Michael,

Comments and more our ansible usage info in-line.

So, AFAIK, it’s never been possible to pass in the sudo boolean via
–extra-args. (Not that it shouldn’t be possible).

On our end, we have been doing so without any problems since 0.5, until the late stage of 0.8.

sudo: True/False explicitly requires a boolean.

It seems you need to set sudo_user to something, though if sudo_user
is not set, it should be root.

You’ve given me enough to look into what’s going on (will look today),
but your usage of passing in the sudo bit via --extra-vars in that way
is undoubtedly why no one else is seeing it – it’s unlikely anyone
else is doing that. Not saying it’s wrong, it’s probably just
unique.

On our various test systems (CentOS, Fedora, SL, Ubuntu), we use sudo a lot and consistently. Developers also get to do things that require ‘root’ privilege, but only two of us sysadmins have the actual root password to each system. Now you can see why we write many of our small playbooks the way it is.

Regards,

– Zack

Fix for passing in sudo via --extra-vars has been pushed.

Many thanks Michael! Fix confirmed.

Regards,

– Zack

* Michael DeHaan <michael.dehaan at gmail.com> [2012/10/26 17:42]:

We do almost nothing with bash. Search for "/bin/sh" in the code
to see what we call.

I didn't mean to imply there *was* a problem, just that it's a thing
to keep in mind. Apologies if I gave the impression that there were
existing problems.