Scott,
Neat to see someone else’s approach. The “fast method” you have there probably could be worked into what’s been merged. Another approach (maybe simpler) would just be stand up a parallel ASG with the new AMI.
The general problem with this approach is that it doesn’t work well for blue-green deployments, nor if the new code can’t coexist with the currently running code. We make that decision before deploy time and put the site in maintenance mode if we determine there’s an incompatibility between the two versions.
I think we’re probably going to move to a system that uses a tier of proxies and two ELBs. That way we can update the idle ELB, change out the AMIs, and bring the updated ELB up behind an alternate domain for the blue-green testing. Then when everything checks out, switch the proxies to the updated ELB and take down the remaining, now idle ELB.
Amazon would suggest using Route53 to point to the new ELB, but there’s too great a chance of faulty DNS caching breaking a switch to a new ELB. Plus there’s a 60s TTL to start with regardless, even in the absence of caching.
I like making the AutoScale Group do the instance provisioning, versus your approach of provisioning the instance and then moving it to an ASG. From what I can tell, your module doesn’t seem to be idempotent – so if it’s run, it’s always going to act. The feature I added only updates instances if they have a launch config that is different from what’s currently assigned to the ASG. So it’s safe to run again (or continue a run that failed for some reason), without having to cycle through all the instances again.
You may have missed the “cycle_all” parameter. If False, only instances that don’t match the new AMI are cycled.
Using the ASG to do the provisioning might be preferable if it’s reliable. At first I went that route, but I was having problems with the ASG’s provisioning being non-deterministic. Manually creating the instances seems to ensure that things happen in a particular order and with predictable speed. As mentioned, the manual method definitely works every time, although I need to add some more timeout and error checking (like what happens if I ask for 3 new instances and only get 2).
I have a separate task that cleans up the old AMIs and LCs, incidentally. I keep the most recent around as a backup for quick rollbacks.
We will be publishing an article on some different approaches that we’ve worked through for doing this “immutablish” deploy stuff sometime next week.
I’m looking forward to reading it for sure.
Regards,
-scott