When I provision an EC2 instance, I add a user_data script which drops an SSH pubkey so I can login as root. The problem is that it’s difficult to tell exactly when cloud-init has been completed. Even if port 22 is accepting connections, the pubkey may not be ready yet and thus SSH logins will fail. I use the following task to try to determine when the instance will accept my connection:
`
- name: ensure instance is ready
become: no
raw: printf “success”
register: result
until: ‘“success” in result.stdout_lines’
retries: 300
delay: 1
failed_when: false
`
This task works maybe 75% of the time. A small fraction of the time, I get a FATAL UNREACHABLE error, but if I rerun the playbook immediately after, it works fine. Since this is a FATAL error, it doesn’t appear that there is any way to retry it.
Prior to using this technique, I used the ‘command’ module to call out to SSH directly which was more reliable because I could do retries, but I have an additional requirement in that I can change the user I’m connecting as using set_fact OR ‘-u’ and this didn’t seem to work with the command module.
Are there any other good patterns to detect when an EC2 instance has completed the entire boot process?