Powering EC2 instances on/off

Hi folks,

We’re trying to implement a system where we can power environments on and off AWS when they’re not in use. However the ec2 inventory module excludes instances that are not in a running state. It seems like adding an option to the ec2 module to include stopped instances would work, but then I guess ansible would need a corresponding option to call the module with to include the stopped instances. Which seems a it hacky…

Maybe ansible needs a notion of host state? Any thoughts?

Thx!

-cs

I use this module: https://github.com/ansible/ansible/pull/6349

Full disclosure: Michael believes all inventory should be done via inventory scripts; I respectfully disagree. :slight_smile: I find ec2.py to be very slow (20 seconds to refresh the cache with a small number of instances, for example) and prefer querying inventory directly in the script itself for many use cases.

Regards,
-scott

Thanks!

That’s interesting, your module is the same as ec2_facts just with filtering. And the ec2_facts module says it may add filtering in the notes. I think I’d agree with Michael’s pov, but it looks like we’ve already gone down facts being outside the inventory module, so maybe a pull request against ec2_facts with the filters would get accepted. Long run it does seem like hosts and modules need to have some idea of state …

Actually, it’s not the same as ec2_facts other than it returns facts about an instance.

ec2_facts only works when run on an actual AWS instance (it calls the Amazon ec2 metadata servers) and it only retrieves the facts for that instance alone.

ec2_instance_facts, on the other hand, can retrieve multiple instance facts at once from anywhere (I use it in a local action). It’s more like ec2.py run for specific instances from within a playbook.

Regards,
-scott

Thanks for the clarification, right, the use case and implementation are a bit different. Seems like they could be combined however.

"so maybe a pull request against ec2_facts with the filters would get accepted. Long run it does seem like hosts and modules need to have some idea of state … "

Anything applying to more than one host definitely shouldn’t be done by the facts module.

So, I’m curious, for the case where you want to start “stopped” EC2 instances, what’s the current recommended approach?

I’ve kind of ignored this task for now, managing that by hand (it’s just our dev env, but it’s still a couple of dozen instances at least). I’m almost about to pull Scott’s branch in locally since it looks so much better than manual management.

Here’s an example in case you do use ec2_instance_facts. This example creates maintenance instances for updating AMIs.

Notes:

  • This is part of a set of scripts that will create an entire load balanced application environment (including DNS, VPC, centralized logging, and RDS) in a bare AWS account in about 20-30 minutes.
  • app_environment is dev, test, stage, or prod. The scripts will create the same setup in each environment with some differences such as RDS size, domain name, and so forth.
  • I use a naming convention for AWS resources of ‘---’, eg. foo-stage-ec2-logging or foo-prod-ami-web.

The base image is created from a standard Ubuntu LTS instance. Then, packages common to all

of the images (eg. security, ansible, boto, etc.) are installed and configured.

There’s a separate pull request (also rejected, hi Michael… :wink: for the ec2_ami_facts module.

  • name: Obtain list of existing AMIs
    local_action:
    module: ec2_ami_facts
    description: “{{ ami_image_name }}”
    tags:
    environment: “{{ app_environment }}”
    region: “{{ vpc_region }}”
    aws_access_key: “{{ aws_access_key }}”
    aws_secret_key: “{{ aws_secret_key }}”
    register: ami_facts
    ignore_errors: yes

If a version of the AMI exists, record this. Otherwise use the base Ubuntu image.

  • set_fact:
    environment_base_image_id: “{{ ami_facts.images[0].id }}”
    when: ami_facts.images|count > 0
  • set_fact:
    environment_base_image_id: “{{ ami_base_image_id }}”
    when: ami_facts.images|count == 0

See if the maintenance image for this image type for this environment is running.

  • name: Obtain list of existing instances
    local_action:
    module: ec2_instance_facts
    name: "{{ ami_maint_instance_name }}”

Everything but terminated

states:

  • pending

  • running

  • shutting-down

  • stopped

  • stopping
    tags:
    environment: “{{ app_environment }}”
    region: “{{ vpc_region }}”
    aws_access_key: “{{ aws_access_key }}”
    aws_secret_key: “{{ aws_secret_key }}”
    register: instance_facts
    ignore_errors: yes

  • set_fact:
    environment_maint_instance: “{{ instance_facts.instances_by_name.get(ami_maint_instance_name) }}”
    when: instance_facts.instances|count > 0

If there is no such instance, create one.

  • name: Create an instance for managing the AMI creation
    local_action:
    module: ec2
    state: present
    image: “{{ environment_base_image_id }}”
    instance_type: t1.micro
    group: “{{ environment_public_ssh_security_group }}”
    instance_tags:
    Name: “{{ ami_maint_instance_name }}”
    environment: “{{ app_environment }}”
    key_name: “{{ environment_public_ssh_key_name }}”
    vpc_subnet_id: “{{ environment_vpc_public_subnet_az1_id }}”
    assign_public_ip: yes
    wait: yes
    wait_timeout: 600
    region: “{{ vpc_region }}”
    aws_access_key: “{{ aws_access_key }}”
    aws_secret_key: “{{ aws_secret_key }}”
    register: maint_instance
    when: environment_maint_instance is not defined

  • set_fact:
    environment_maint_instance: “{{ maint_instance.instances[0] }}”
    when: maint_instance is defined and maint_instance.instances|count > 0

  • name: Ensure instance is running
    local_action:
    module: ec2
    state: running
    instance_ids: “{{ environment_maint_instance.id }}”
    wait: yes
    wait_timeout: 600
    region: “{{ vpc_region }}”
    aws_access_key: “{{ aws_access_key }}”
    aws_secret_key: “{{ aws_secret_key }}”
    register: maint_instance
    when: environment_maint_instance is defined

If we had to start the instance then the public IP will not have been defined when

we gathered facts above, so get it again.

  • name: Obtain public IP of newly running instance
    local_action:
    module: ec2_instance_facts
    name: “{{ ami_maint_instance_name }}”
    states:

  • running
    tags:
    environment: “{{ app_environment }}”
    region: “{{ vpc_region }}”
    aws_access_key: “{{ aws_access_key }}”
    aws_secret_key: “{{ aws_secret_key }}”
    register: instance_facts
    when: maint_instance|changed

  • set_fact:
    environment_maint_instance: “{{ instance_facts.instances_by_name.get(ami_maint_instance_name) }}”
    when: maint_instance|changed

Pass the collected facts on the new maintenance image host for configuration by role.

  • name: Add new maintentance instance to host group
    local_action:
    module: add_host
    hostname: “{{ environment_maint_instance.public_ip }}”
    groupname: maint_instance
    app_environment: “{{ app_environment }}”

This passes the new/existing private key file to ansible for use in contacting the hosts. Better way to do this?

ansible_ssh_private_key_file: “{{ environment_public_ssh_private_key_file }}”
environment_maint_instance: “{{ environment_maint_instance }}”

  • name: Wait for SSH on maintenance host
    local_action:
    module: wait_for
    host: “{{ environment_maint_instance.public_ip }}”
    port: 22

This is annoying as Hades. Sometimes the delay works, sometimes it’s not enough.

The check fails if the port is open but the ssh daemon isn’t yet ready to accept

actual traffic, right after the maintenance instance is started.

#delay: 10
timeout: 320
state: started

TODO fix the hardcoded user too

  • name: Really wait for SSH on maintenance host
    local_action: command ssh -o StrictHostKeyChecking=no -i {{ environment_public_ssh_private_key_file }} ubuntu@{{ environment_maint_instance.public_ip }} echo Rhubarb
    register: result
    until: result.rc == 0
    retries: 20
    delay: 10

Regards,
-scott

I’m fairly new to Ansible. How do I get your code into my Ansible install so I can use it? I run from source.

Thanks!

James

I keep all of my new/modified modules in a library directory under where my play books are. Ansible will find the libraries there and use them over the ones in the Ansible install.

Regards,
-scott

Using local ./library content is fine, but please don’t run a fork with extra packages added if you are going to ask questions about them – or at least identify that you are when you do.

It can make Q&A very confusing when people ask about things that aren’t merged.

Just as a side-note, I was able to get the wait_for mode to work for ssh with a bit of fiddling (so you don’t have to wait with 2 tasks):

  • hosts: 127.0.0.1
    connection: local
    gather_facts: false
    vars_files:
  • env.yaml
    tasks:
  • name: Wait for SSH to come up after the reboot
    wait_for: host={{item}} port=22 delay=60 timeout=90 state=started
    with_items: groups.tag_env_{{pod}}_
    ignore_errors: yes
    register: result
    until: result.failed is not defined
    retries: 5

This seems to work for me all the time, but maybe I just got lucky. I create groups based on tags. Groups are tag-based “class_database”: “”, “class_monitoring”: “”, “env_qa1”: “”, which I register using add_host.

I’m always a bit wary when so many keywords come together. It’s usually the sign something can be simplified and is not “Ansible-like” enough.

  • name: Wait for SSH to come up after the reboot
    wait_for: host={{item}} port=22 delay=60 timeout=90 state=started
    with_items: groups.tag_env_{{pod}}_
    ignore_errors: yes
    register: result
    until: result.failed is not defined
    retries: 5

Can likely be simplified to:

  • hosts: localhost
    tasks:

  • ec2: # provisioning step here with add_host…

  • hosts: groups.tag_env_{{ pod }}_
    tasks:

  • name: Wait for SSH to come up after the reboot

local_action: wait_for host={{item}} port=22 delay=60 timeout=90

A few key concepts:

(A) Using the host loop is clearer than doing a “with_items” across the group

(B) You should only need to do one wait_for. Consider increasing the timeout rather than looping over a retry

(C) You should not need to register the result of the retry since there is no loop

(D) You won’t need to ignore errors because we’re running wait_for off localhost, which we know we can connect to.

Then I can consider this a bug report. Without retries, wait_for fails for every EC2 AMI I tried (admitedly, they’re all variations of CentOS).

Things I’ve seen:

  • it reports port open, then refuses to connect
  • it reports times out even though I was able to manually log in prior to the timeout
  • it fails with ssh errors while checking the port (this one is a bit rare)

This combination is less than ideal, but it seemed to work for all my cases. Also, a minor thing, you have an ec2 task then you start using the groups.tag_xxx, is it implied you have an add_host there? Cause my ec2 instances won’t appear unless I add that.

Nvm, saw the add_host in the comment.