EC2 module: state=present always returns 'wait for instances running timeout'

We’ve had a task to create EC2 instances running with no changes for weeks. Suddenly sometime after Thu Nov 13 04:11:43 UTC 2014, the module would timeout on all requests to create new instances. This happens on both Mac OS X and on Ubuntu. I’ve updated to the latest ansible, boto, awscli, etc. with no effect. We haven’t updated this code in a very long time.

Was something changed during AWS re:invent? Is there something else going on with the AWS CLI that the EC2 module conflicts with now?

ansible-playbook 1.7.2

Python 2.7.5+

aws-cli/1.6.2 Python/2.7.5+ Linux/3.11.0-12-generic

Not that I’m aware of.

Anyone else seeing similar problems?

Is there any way to get more debugging than ansible-playbook -vvvv provides? I’d like to know if it’s a network issue, ssh, or something else and the instance is terminated before I can test it.

I figured this out, and it prompts me to wonder if there’s been a request to propagate boto debug info via ansible yet. If not, is there a clean way this could be included? I had to write a script to debug boto output to get the reason for the failure. Being able to propagate this error up to ansible might be beneficial in the future. Additionally, I’m wondering why so many volumes were left orphaned in our AWS account. I’m not blaming the EC2 module, but I haven’t found that cause yet.

We hit a limit in our EBS volumes. For future debugging reference, I’ll include my troubleshooting below.

I created a simple script based on a StackOverflow post (http://stackoverflow.com/a/20658354/1464556):

`
#!/usr/bin/python

import boto
import os
from pprint import pprint

version = boto.Version
print version
boto.set_stream_logger(‘boto’)
conn = boto.connect_ec2(aws_access_key_id=os.environ[‘AWS_ACCESS_KEY_ID’],aws_secret_access_key=os.environ[‘AWS_SECRET_ACCESS_KEY’])

res= conn.get_all_reservations()

pprint(res[0].instances)
`

I ran the ansible-playbook command and took snapshots of the boto output every 15 seconds while it ran, then diff’ed the first and last one:

`
while true; do echo “running $(date)”; /tmp/test.py &> /tmp/test.py.output.$(date +%s); sleep 15; done

`

This was the obvious difference:

`

Client.VolumeLimitExceeded Client.VolumeLimitExceeded: Volume limit exceeded

`