extra scaling events with ec2_asg module

For Ansible 1.9-develop Pull request 601 had the fix for Issue 383, which does affect our production ASG about every two weeks or so. We use the ec2_asg module to refresh our ASG instances 3 times a day.

I was eager to test. In doing so, I noticed that the replace_all_instances or replace_instances options cause extra set of scaling events. Has anyone else who uses either replace_ option see this happen? See below for the screen shot which demonstrates the behavior.

We have one instance in two different Availability Zones. So we use a batch size of two (actually a formula based upon the length of the availability_zones list of the ASG).

Interesting… I just tested with batch_size: 1. The extra set of scaling events was 1. I.e. one new instance launched and one new instance terminated.

The batch_size logic is broken. I am going open an Issue in ansible-modules-core, but welcome others to note their experience here. I’ll update this topic with a link to the Issue, too.

`

  • name: Retrieve Auto Scaling Group properties
    local_action:
    module: ec2_asg
    name: “{{ asg_name }}”
    state: present
    health_check_type: ELB
    register: result_asg

  • name: Auto Scaling Group properties
    debug: var=result_asg

  • name: Replace current instances with fresh instances
    local_action:
    module: ec2_asg
    name: “{{ asg_name }}”
    state: present
    min_size: “{{ result_asg.min_size }}”
    max_size: “{{ result_asg.max_size }}”
    desired_capacity: “{{ result_asg.desired_capacity }}”
    health_check_type: “{{ result_asg.health_check_type }}”
    lc_check: no
    replace_all_instances: yes
    replace_batch_size: “{{ result_asg.availability_zones | length() }}”

`

  1. and 2. are expected. a. - d. are extra scaling events.

Looking forward to the github issue – make sure you take a look at the autoscale group and the ELB in the AWS console and see if it gives a description why the instances were terminated. I’ve seen cases where things did not come online fast enough and the ELB marks them as unhealthy and the ASG terminates them.

Thanks,

James

Thanks James.

All the instances terminated are due to being marked Unhealthy by terminate_batch().

I am using the changes from this PR: https://github.com/ansible/ansible-modules-core/pull/589, combined with the fixes in PR 601. Rationale: I need lc_check=no to cause all instances to get replaced. With the current way it is written in the module lc_check only works if the active if an instance has a different Launch Config than the one assigned to the ASG. Upon further consideration, I should add a new option instead of overloading the meaning of lc_check.

I spent some time re-working the algorithm that does the rolling replacement. It is much smarter now, and it shouldn’t cause unecessary scaling events. I’ve also merged the functionality of #589. Would you mind giving it a whirl?

https://github.com/ansible/ansible-modules-core/pull/1030