Ansible fails to connect to newly provisioned EC2 on 3rd task after successful running first 2 tasks

I have a playbook that creates EC2 instances and adds them to an in memory group using the add_host module. I am then able to connect to the in memory group and perform two successful commands before a third fails.

I am seeing this problem just running the same file module to create directories. I have something like this in my main playbook (ec2hosts is the in-memory group creating after provisioning)

  • hosts: ec2hosts
    user: ubuntu
    gather_facts: false
    name: try the setup
    tasks:

  • name: Get EC2 facts
    ec2_metadata_facts:
    register: ec2_facts

  • name: import configure role
    import_role:
    name: configure
    vars:
    efs_ids: “{{ efs_id }}”

The configure role is very simple:

  • name: Make the aws credentials directory
    file:
    state: directory
    path: ~/.aws
  • name: Make the hi directory
    file:
    state: directory
    path: ~/.hi
  • name: Make a temp directory
    file:
    state: directory
    path: ~/.temp
  • name: Make a bar directory
    file:
    state: directory
    path: ~/.bar

And this fails at the Make a temp directory task. The failed output with -vvv looks like:

<35.160.185.188> (0, ‘’, “Warning: Permanently added ‘35.160.185.188’ (ECDSA) to the list of known hosts.\r\n”) <35.160.185.188> ESTABLISH SSH CONNECTION FOR USER: ubuntu <35.160.185.188> SSH: EXEC ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i AnsibleTest.pem -o ‘IdentityFile=“[omitted_full_path]/AnsibleTest.pem”’ -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=ubuntu -o ConnectTimeout=10 -tt 35.160.185.188 ‘/bin/sh -c ‘"’"’/usr/bin/python /home/ubuntu/.ansible/tmp/ansible-tmp-1510596698.75-58373657425242/file.py; rm -rf “/home/ubuntu/.ansible/tmp/ansible-tmp-1510596698.75-58373657425242/” > /dev/null 2>&1 && sleep 0’“'”‘’ <35.160.185.188> (255, ‘’, ‘ssh_exchange_identification: read: Connection reset by peer\r\n’) fatal: [35.160.185.188]: UNREACHABLE! => { “changed”: false, “msg”: “Failed to connect to the host via ssh: ssh_exchange_identification: read: Connection reset by peer\r\n”, “unreachable”: true }

I am using the following ssh_args in my ansible.cfg for the playbook:

ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i “AnsibleTest.pem”

Does anyone know what’s happening here? This seems pretty weird and I’m stuck.
Thanks!

I eventually fixed this by adding a retries parameter to the [ssh_connection] section of my ansible.cfg so it looks like the below.

[ssh_connection]

ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no
retries = 10

Pretty lame! If anyone finds a better solution please let me know…

Perhaps the instance was not actually up and accessible yet. Did you use wait_for or wait_for_connection after creating the instance to wait and ensure they are up and accessible before moving on?

Yes, I use the wait parameter in the ec2 module. And the weirdest thing is that two of the tasks work before the third fails, so the connection is up and working and then just stops working. I’ve seen this when uploading files as well with messages like “the sftp file transfer mechansi failed”, but the retries work and come to the rescue.

From where I sit, I pretty much don’t see a way to do any of this EC2 stuff without the retries and am surprised no one else has seen this.