EC2 slow cloud-init, Ansible SSH connection fails due to race condition (wait_for is not good enough)

James_Cuzella · January 25, 2016, 9:03pm

Hello,

I believe I’ve found an interesting race condition during EC2 instance creation due to a slow-running cloud-init process. The issue is that cloud-init appears to create the initial login user & installs the public SSH key onto a newly started EC2 instance, then restarts sshd. It takes a while to do this, and creates a race condition where Ansible cannot connect to the host and fails the playbook run. In my playbook, I’m using the ec2 module, followed by add_host, and then wait_for to wait for the SSH port to be open. I have also experimented with using a simple “shell: echo host_is_up” command with a retry / do-until loop. However this also fails because Ansible wants the initial SSH connection to be successful, which it will not in this case. So Ansible does not retry

It appears that due to the user not existing until ~3 minutes after it is booted and sshd is listening on port 22, Ansible cannot connect as the initial login user for the CentOS AMI (“centos”). So the SSH port open check is not good enough to detect and wait for the port to be open AND the login user to exist. The simple echo shell command with retry do/until loop also does not work, because the very first SSH connection Ansible tries to make to run the module fails also.

For some detailed debug info, and a playbook to reproduce the issue, please see this Gist: https://gist.github.com/trinitronx/afd894c89384d413597b

My question is: Has anyone run into a similar issue with EC2 instances being slow to become available causing Ansible to fail to connect, and also found a solution to this?

I realize that a sleep task is one possible solution (and I may be forced to reach for that sledgehammer), but it doesn’t feel like the absolute best solution because we really want to wait for both cloud-init to be finished creating “centos” user on the instance AND SSH to be up. So really, the only other way I can think of is to somehow tell SSH to retry connecting as centos until it succeeds or a surpasses a very long timeout. Is this possible? Are there better ways of handling this?

Jared_Bristow · May 9, 2016, 7:46pm

I am having this same issue. Did you ever figure out a solution?

I have 3 different images I’m testing against: CentOS6, CentOS7, Sles12. The strange thing is that I only seem to have a problem on CentOS7.

Allen_Sanabria1 · May 9, 2016, 8:20pm

This is what I do, to make sure that SSH comes up, but also wait until the user has been created on my instance.

`

- set_fact:
    ec2_ip: "{{ ec2_name | get_instance(aws_region, state='running') }}"

- name: Wait for SSH to come up on instance
  wait_for:
    host: "{{ ec2_ip }}"
    port: 22
    delay: 15
    timeout: 320
    state: started

- name: Wait until the ansible user can log into the host.
  local_action: command ssh -oStrictHostKeyChecking=no ansible@{{ ec2_ip }} exit
  register: ssh_output
  until: ssh_output.rc == 0
  retries: 20
  delay: 10

`

jdelaporte · July 13, 2016, 5:40pm

I just discovered this issue as well, with various random ssh connection or generic authentication/permission failure messages, some occurring after a play successfully passed a task or two. It occurred very consistently with many CentOS 7 t2.nano hosts. A 2-minute pause after waiting for ssh listener resolved it for me. The system logs showed the centos user added about 1 minute into boot time, so I gave it two minutes to be generous:

`

name: wait for instances to listen on port:22
wait_for:
state: started
host: “{{ item }}”
port: 22
name: wait for boot process to finish
pause: minutes=2

`

It also helped to make sure I removed old host keys, even though I have strict checking turned off:

`

name: remove old host keys
known_hosts:
name: “{{item}}”
state: absent
with_items: “{{aws_eips}}”
`

Joanna

PS) Some of the errors I saw caused by this:
“failed to transfer file to /tmp/.ansible”

“Authentication or permission failure. In some cases, you may have been able to authenticate and did not have permissions on the remote directory. Consider changing the remote temp path in ansible.cfg to a path rooted in "/tmp". Failed command was: ( umask 77 && mkdir -p "` echo /tmp/.ansible/…”

j.r.hawkesworth · July 14, 2016, 8:16pm

I’m not an ec2 user but I wonder if it might be possible to adapt the approach used here to wait for a webservice to return:

https://groups.google.com/forum/#!topic/ansible-project/iLjIbsCASWU

In that case he’s using uri with an ‘until’ and using the ‘default’ filter to make sure no results doesn’t mean a task failure.

- name: run test
uri:
url: "[https://0.0.0.0:3030/api/canary](https://0.0.0.0:3030/api/canary)"
validate_certs: no
register: result
until: result['status']|default(0) == 200

Obviously you’d have to replace uri with something that tells you the instance is ready to start working, but it might give you a way to get rid of the need for the arbitrary pause.

Jon

Topic		Replies	Views
ssh failing for a newly created EC2 instances Ansible Project ubuntu	17	0	September 7, 2014
Ansible fails to connect to newly provisioned EC2 on 3rd task after successful running first 2 tasks Ansible Project ubuntu , aws	3	5	December 8, 2017
Reliable way to detect when EC2 instance is ready for login Ansible Project	1	8	June 10, 2016
Playbook can't connect to EC2 instance, but SSH works fine? Ansible Project aws	1	13	October 27, 2016
Ansible AWS EC2 create scripts gives this error:: "Timeout when waiting for 172.31.25.6:22"} Ansible Project aws	2	5	December 18, 2017

EC2 slow cloud-init, Ansible SSH connection fails due to race condition (wait_for is not good enough)

Related topics