EC2 Rolling Deploy with an ASG

Hello,

I’m trying to use Ansible to do a rolling deploy against an ELB linked to an auto-scaling group (ASG), using a pre-baked AMI. My ideal process would go something like this

  1. Get the current membership of the ASG
  2. Update the launch configuration for the ASG
  3. For each member:
    3a. Create an instance using the new AMI
    3b. Associate the instance with the ASG
    3c. Terminate the original instance

The other option I was considering was:

  1. Get the current membership of the ASG

  2. Update the launch configuration for the ASG

  3. For each member:
    3a. Terminate the instance
    3b. Wait until the ASG has noticed and launched a new instance before continuing

For the former, I don’t see a way using the built-in EC2 modules to associate an instance with an ASG. For the latter, I’m not clear how I’d wait until the ASG has launched a new instance to catch up with the one I terminated.

Any suggestions on how to do either one, or if that’s not possible, what the best-practice for what I’m trying to do it?

Thanks,

Dan

Hi,

James Martin is working on a 2-3 part blog post on exactly this subject, which I believe we’re going to be posting this week, which shows a couple of ways to do it.

I’ve included him on this mailing list thread if he wants to share some cliff-notes.

–Michael

Dan,

I’ve been tinkering with this process for quite a while and have made a pull request to ansible core that I believe does what you are looking for:

https://github.com/ansible/ansible/pull/8901

As Michael stated, we will be releasing a blog post that’s going to go more in depth in describing a few different ways to perform updates to ASG’s that use pre-baked AMIs (this module approach being one of them).

I appreciate any feedback/testing you can provide on that pull request of course. The documentation is inline in the module source.

Thanks,

  • James

Hi James,

Thanks!

In reading the PR examples section, I’m curious why we might show Option 1 if Option 2 is much cleaner and would be interested in details.

Also, quick question - it’s replacing all instances, but what’s it replacing them with ?

Perhaps this is something we should show as well, where we indicate how to specify what the new instance IDs would be.

Can you help me grok additions?

Thanks again!

+## Option 2

+This does everything that Option 1 does, but is contained inside the module. It’s more opaque,

+but the playbooks end up being much clearer.

FWIW I like the cleanliness of Option 2 - would it still support the options like replace_batch_size?

Michael,

The reason for having both was to spur this very discussion. :). Option 1 is a bit more complicated but more transparent, option 2 is much easier but less transparent. I’m more fond of option 2, and happy to make it the only one. BTW, are we talking about the docs or the actual feature?

As far as what the instances are being replaced with-- the ASG is going to spin up new instances with the current launch configuration. With option 2, the module starts by building a list of which instances should be replaced. This list is made up of all instances that have not been launched with the current launch configuration. The module then bumps the size of the ASG by the replace_batch size. It then terminates replace_batch_size instances at a time, waits for the ASG to spin up new instances in their place and become healthy, then continues on down the list until there are no more left to replace. Then it sets the ASG size back to it’s original value.

James

Daniel,

Yep, Option 2 is designed to work with replace_batch_size. Currently it’s mutually exclusive with “replace_instances”, but I think if we decide to go with Option 2, I could make replace_instances work. I think it might be desirable to replace all instances or just a few, but still have the playbook be nice and clean.

  • James

Michael,

The reason for having both was to spur this very discussion. :). Option 1
is a bit more complicated but more transparent, option 2 is much easier but
less transparent. I'm more fond of option 2, and happy to make it the only
one. BTW, are we talking about the docs or the actual feature?

I'm not sure option 1 is transparent more so than more manual/explicit? I
guess if you mean "less abstracted", yes. I would prefer the one that lets
me forget more about how it works :slight_smile:

As far as what the instances are being replaced with-- the ASG is going to
spin up new instances with the current launch configuration. With option 2,
the module starts by building a list of which instances should be
replaced. This list is made up of all instances that have not been
launched with the current launch configuration. The module then bumps the
size of the ASG by the replace_batch size. It then terminates
replace_batch_size instances at a time, waits for the ASG to spin up new
instances in their place and become healthy, then continues on down the
list until there are no more left to replace. Then it sets the ASG size
back to it's original value.

Ok, so I'm thinking *MAYBE* in the examples, we show a call to ec2_lc to
show the launch config change prior to the invocation, so the user can see
this in context.

Sidenote to all - our ec2 user guide in the docs are lacking, and I'm open
to having them mostly rewritten. Showing a more end to end tutorial, maybe
one ec2 simple one and another using ec2_lc/asg, would be really awesome
IMHO.

Just wanted to note that this code has now been merged into ec2_asg in Ansible 1.8 devel branch, and the docs are updated with examples.

Thanks,

James

I have some relatively extensive documentation on ec2 - it might be a little too over the top for the user guide.
http://willthames.github.io/2014/03/17/ansible-layered-configuration-for-aws.html

If you want me to incorporate any or all of it into the user guide, I’d be happy to do so.

I haven’t done enough with asgs to contribute much (and it seems like James’ docs are pretty good to go anyway)

Will

I definitely would like to see the ec2 guide upgraded to teach more ec2 concepts.

It’s largely a holdover from the very early days, and needs to show some basics like using add_host together with ec2 (as is shown elsewhere)
but also some more idioms.

I’d be quite welcome to see it mostly rewritten should you want to take a stab at improving it.

Wow, I wish I’d seen this conversation earlier.

I have a module that does this, using something similar to option 1.

My module respects multi-AZ load balancers and results in a completely transparent deploy, so long as the code in the new AMI can run alongside the old code. There’s a start of two different methods, one which replaces a single instance at a time and the other which fires up all the new instances in the proper VPCs, waits for them to initialize, adds them to the elb and ash, then terminates once they’re all stable.

You also have to set up session pinning and draining on the elb for it to function correctly. Otherwise you can end up with someone getting assets from two different code bases.

There’s actually a more reliable way to do it that involves using intermediary instances, but we haven’t gotten that far yet.

-scott

For comparison:

https://github.com/scottanderson42/ansible/blob/ec2_vol/library/cloud/ec2_asg_cycle

Still a work in progress (as you should be able to tell from the logging statements :-), but we’ve been using it in production for several months and it’s (now) battle tested. The “Slow” method is unimplemented but is intended to be your Option 2.

-scott

Scott,

Neat to see someone else’s approach. The “fast method” you have there probably could be worked into what’s been merged. Another approach (maybe simpler) would just be stand up a parallel ASG with the new AMI.

I like making the AutoScale Group do the instance provisioning, versus your approach of provisioning the instance and then moving it to an ASG. From what I can tell, your module doesn’t seem to be idempotent – so if it’s run, it’s always going to act. The feature I added only updates instances if they have a launch config that is different from what’s currently assigned to the ASG. So it’s safe to run again (or continue a run that failed for some reason), without having to cycle through all the instances again.

We will be publishing an article on some different approaches that we’ve worked through for doing this “immutablish” deploy stuff sometime next week.

Scott,

Neat to see someone else’s approach. The “fast method” you have there probably could be worked into what’s been merged. Another approach (maybe simpler) would just be stand up a parallel ASG with the new AMI.

The general problem with this approach is that it doesn’t work well for blue-green deployments, nor if the new code can’t coexist with the currently running code. We make that decision before deploy time and put the site in maintenance mode if we determine there’s an incompatibility between the two versions.

I think we’re probably going to move to a system that uses a tier of proxies and two ELBs. That way we can update the idle ELB, change out the AMIs, and bring the updated ELB up behind an alternate domain for the blue-green testing. Then when everything checks out, switch the proxies to the updated ELB and take down the remaining, now idle ELB.

Amazon would suggest using Route53 to point to the new ELB, but there’s too great a chance of faulty DNS caching breaking a switch to a new ELB. Plus there’s a 60s TTL to start with regardless, even in the absence of caching.

I like making the AutoScale Group do the instance provisioning, versus your approach of provisioning the instance and then moving it to an ASG. From what I can tell, your module doesn’t seem to be idempotent – so if it’s run, it’s always going to act. The feature I added only updates instances if they have a launch config that is different from what’s currently assigned to the ASG. So it’s safe to run again (or continue a run that failed for some reason), without having to cycle through all the instances again.

You may have missed the “cycle_all” parameter. If False, only instances that don’t match the new AMI are cycled.

Using the ASG to do the provisioning might be preferable if it’s reliable. At first I went that route, but I was having problems with the ASG’s provisioning being non-deterministic. Manually creating the instances seems to ensure that things happen in a particular order and with predictable speed. As mentioned, the manual method definitely works every time, although I need to add some more timeout and error checking (like what happens if I ask for 3 new instances and only get 2).

I have a separate task that cleans up the old AMIs and LCs, incidentally. I keep the most recent around as a backup for quick rollbacks.

We will be publishing an article on some different approaches that we’ve worked through for doing this “immutablish” deploy stuff sometime next week.

I’m looking forward to reading it for sure.

Regards,
-scott

The general problem with this approach is that it doesn’t work well for
blue-green deployments, nor if the new code can’t coexist with the
currently running code.

Yep, understood.

I think we’re probably going to move to a system that uses a tier of
proxies and two ELBs. That way we can update the idle ELB, change out the
AMIs, and bring the updated ELB up behind an alternate domain for the
blue-green testing. Then when everything checks out, switch the proxies to
the updated ELB and take down the remaining, now idle ELB.

Not following this exactly -- what's your tier of proxies? You have a
group of proxies (haproxy, nginx) behind a load balancer that point to your
application?

Amazon would suggest using Route53 to point to the new ELB, but there’s
too great a chance of faulty DNS caching breaking a switch to a new ELB.
Plus there’s a 60s TTL to start with regardless, even in the absence of
caching.

Quite right. There are some interesting things you can do with tools you
could run on the hosts that would redirect traffic from blue hosts to the
green LB, socat being one. After you notice no more traffic coming to
blue, you can terminate it.

You may have missed the “cycle_all” parameter. If False, only instances
that don’t match the new AMI are cycled.

You're right, I did miss that. By checking the AMI, you're only updating
the instance if the AMI changes. If you a checking the launch config, you
are updating the instances if any component of the launch config has
changed -- AMI, instance type, address type, etc.

Using the ASG to do the provisioning might be preferable if it’s reliable.
At first I went that route, but I was having problems with the ASG’s
provisioning being non-deterministic. Manually creating the instances seems
to ensure that things happen in a particular order and with predictable
speed. As mentioned, the manual method definitely works every time,
although I need to add some more timeout and error checking (like what
happens if I ask for 3 new instances and only get 2).

I didn't have any issues with the ASG doing the provisioning, but I would
say nothing is predictable with AWS :).

I have a separate task that cleans up the old AMIs and LCs, incidentally.
I keep the most recent around as a backup for quick rollbacks.

That's cool, care to share?

Yes, nginx or some other HA-ish thing. If it’s nginx then you can maintain a brochure site even if something horrible happens to the application.

That’s an interesting idea, but it fails if people are behind a caching DNS and they visit after you’ve terminated the blue traffic but before their caching DNS lets go of the record.

That’s true, but if I’m changing instance types I’ll generally just cycle_all. Because of the connection draining and parallelism of the instance creation, it’s just as quick to do all of them instead of the ones that needs changing. That said, it’s an obvious optimization for sure.

Very true. Over the past few months I’ve had several working processes just fail with no warning. The most recent is AWS sometimes refusing to return the current list of AMIs. Prior to that it was the Available status on an AMI not really meaning available. Now I check the list of returned AMIs in a loop until the one I’m looking for shows up, Available status notwithstanding. Very frustrating. Things could be worse, however: the API could be run by Facebook…

I think I’ve posted it before, but here’s the important bit. After deleting everything but the oldest backup AMI (determined by naming convention or tags), delete any LC that doesn’t have an associated AMI:

def delete_launch_configs(asg_connection, ec2_connection, module):
changed = False

launch_configs = asg_connection.get_all_launch_configurations()

for config in launch_configs:
image_id = config.image_id
images = ec2_connection.get_all_images(image_ids=[image_id])

if not images:
config.delete()
changed = True

module.exit_json(changed=changed)

-scott

Hi all,

Sorry for resurrecting an old thread, but wanted to mention my experience thus far using ec2_asg & ec2_lc for code deploys.

I’m more or less following the methods described in this helpful repo

https://github.com/ansible/immutablish-deploys

I believe the dual_asg role is accepted as the more reliable method for deployments. If a deployment uses two ASGs, it’s possible to just delete the new ASG and everything goes back to normal. This is the “Netflix” manner of releasing updates.

The thing I’m finding though is that instances become “viable” well before they’re actually InService in the ELB. From the ec2_asg code and by running ansible in verbose mode it’s clear that ansible considers an instance viable once AWS indicates that instances are Healthy and InService. Checking via the AWS CLI tool, I can see that the ASG shows instances as Healthy and InService, but the ELB shows OutOfService.

The AWS docs are clear about the behavior of autoscale instances with health check type ELB: “For each call, if the Elastic Load Balancing action returns any state other than InService, the instance is marked as unhealthy.” But this is not actually the case.

Has anyone else encountered this? Any suggested workarounds or fixes?

Thanks,
Ben

Ben,

Thanks for the question. Considering this: http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-add-elb-healthcheck.html, “Auto Scaling marks an instance unhealthy if the calls to the Amazon EC2 action DescribeInstanceStatus return any state other than running, the system status shows impaired, or the calls to Elastic Load Balancing action DescribeInstanceHealth returns OutOfService in the instance state field.”

For determining the instance health status, we are fetching an ASG object in boto and checking the health_status attribute for each instance in the ASG, which are equal to either “healthy” or “unhealthy”. Are you using an instance grace period option for the ELB? http://docs.aws.amazon.com/AutoScaling/latest/APIReference/API_CreateAutoScalingGroup.html, see HealthCheckGracePeriod. This option is configurable with the health_check_period setting found in the ec2_asg module. By default it is 500, and this would prematurely return the status of a healthy instance, as it means it would mark any instance as healthy for 500 seconds.

  • James

Hi James,

Thanks for your reply.

Interesting point about the HealthCheckGracePeriod option. I wasn’t aware of its role here. I am indeed using it, in fact according to the docs it is a required option for ELB health checks. I had it set to 180, and I just tried it with lower values of 10 and 1 second. In both cases the behavior is the same: the autoscale group considers the instances healthy (because of the grace period, even at the lower value) and as a result ansible moves on before the instances are InService in the ELB. Even with the HealthCheckGracePeriod at the lowest possible value of 1 second, a race exists between the module’s health check and the ELB grace period.

I’ve worked around this for now with a script that does the following:

  • Find the instances in the ASG
  • Check the ELB to determine if they are healthy or not
  • Exit 1 if not, 0 if yes

Then I use an ansible task with an “until” loop to check the return code. The script is here:

https://gist.github.com/anonymous/05e99828848ee565ed33

Happy to work this in to an ansible module if you think this is useful. Or did I misunderstand the point about the health check grace period?

Thanks,
Ben