wait_for_connection: doesn't do the job (as expected)

Hi,

I’m starting two ec2 SLES12 instances with ansible. After they are up&running the same playbook should configure the machines.

So I inculde

  • name: Wait for connection to
    wait_for_connection:

after the ec2: module. Ansible is waiting for connecting some seconds. That’s OK.

But the check doesn’t work as expected, because when the playbook continues doing something on the ec2s I’m getting a

FAILED! => {“failed”: true, “msg”: "Timeout (12s) waiting for privilege escalation prompt: "}

OK. Shit happens. But we have a do-until loop.

  • name: start config
    setup: gather_timeout=120
    register: result
    ignore_errors: yes
    until: result|success
    retries: 10
    delay: 10

I would expect that this loop would continue until both machines succeeded with the setup task.

But

TASK [Wait for connection to] **********************************************************************************************************************************************************************

ok: [ip-10-104-30-63.eu-central-1.compute.internal]
ok: [ip-10-104-28-82.eu-central-1.compute.internal]

TASK [start config] **********************************************************************************************************************************************************************

ok: [ip-10-104-30-63.eu-central-1.compute.internal]
fatal: [ip-10-104-28-82.eu-central-1.compute.internal]: FAILED! => {“failed”: true, “msg”: "Timeout (12s) waiting for privilege escalation prompt: "}
…ignoring

If I don’t ignore the error the playbook would stop right here.
If I ignore the error on maschine ip-10-104-28-82 no facts where gathered, so the playbook failes later.

a) Is this a bug in “wait_for_connection:”? (I think yes.)
b) How to write a playbook that is fail safe?

Thanks,
Reiner

I'm starting two ec2 SLES12 instances with ansible. After they are
up&running the same playbook should configure the machines.

So I inculde

- name: Wait for connection to
   wait_for_connection:

after the ec2: module. Ansible is waiting for connecting some seconds.
That's OK.

-snip-

a) Is this a bug in "wait_for_connection:"? (I think yes.)

No, wait_for_connection does a complete end-to-end test by running a ping/win_ping module on the remote end. If it reports 'ok', then the service worked without a doubt.

The time-out waiting for a privileged escalation prompt indicates to me that when the system returns and provides a working transport, that the privilege escalation is not working yet.

If this is the case, we should be looking at making sure that wait_for_connection is also using the privilege ecalation. That might be a solution, but you have to check. Did you try running it as root (without privilege escalation) or running everything as user.

Does it fail in this case too ?

b) How to write a playbook that is fail safe?

It appears that somehow on your system the service becomes available, and then disappears or is blocked again. And that seems to be the problem. If this has to do with timing and you know it settles afterwards within 15 seconds, you could add a `pause` task.

But the essence here is, you have to figure out what exactly is happening, before you can come up with a working solution.

-snip-

a) Is this a bug in "wait_for_connection:"? (I think yes.)

No, wait_for_connection does a complete end-to-end test by running a
ping/win_ping module on the remote end. If it reports 'ok', then the
service worked without a doubt.

The time-out waiting for a privileged escalation prompt indicates to me
that when the system returns and provides a working transport, that the
privilege escalation is not working yet.

If this is the case, we should be looking at making sure that
wait_for_connection is also using the privilege ecalation. That might be a
solution, but you have to check. Did you try running it as root (without
privilege escalation) or running everything as user.

Does it fail in this case too ?

How to test this? "become: false"?

The first thing I configure on the machine is a swap file. So the playbook
needs root privileges.

What I can tell is that it seems to be a SLES 12 problem. In the cloud-init
file of the EC2 instances I have a

runcmd:
  - export HTTPS_PROXY=http://<blabla>:8080
  - /usr/sbin/registercloudguest --force-new

to register SLES to the SUSE repos.

If I disable this three lines wait_for_connection: is working. (I tried
several times.)

But I don't know what registercloudguest is doing and how this is effecting
privilege escalation of ansible.

b) How to write a playbook that is fail safe?

It appears that somehow on your system the service becomes available, and
then disappears or is blocked again. And that seems to be the problem. If
this has to do with timing and you know it settles afterwards within 15
seconds, you could add a `pause` task.

Which "service" does ansible use for privilege escalation? I think "sudo"
is not a running service so I can't be blocked.Or?

But the essence here is, you have to figure out what exactly is happening,
before you can come up with a working solution.

Difficult to debug because during boot I can't access the machine. And

later in cloud-init-output.log is nothing about any error message.

Reiner