Immutable servers using rax module


We were considering taking an “immutable server” approach to deployments (we may not be able to due to 3rd party restrictions whitelisting IPs), is that something that would be possible using the rax module? So, a deployment would consist of spinning up and provisioning x new servers, adding them to the LB, waiting to check for errors, and then removing the old servers. I couldn’t think of a way to do so while using the name & exact_count approach.



I believe this to be possible. I don’t think I can architect you a full solution via email and what not, but we can go over some key points. There are likely to be a number of “moving” parts here.

It sounds like you are roughly wanting to perform something similar to a Blue-Green deployment, while maintaining the original LB.

Here are some of my thoughts.

  1. Run a task using rax_clb and register the results. This will contain a list of current nodes
  2. Run your provisioning task using the rax module with exact_count, however you should have a way to double the exact_count (e.g. 20 instead of the normal 10), so that you can provision new servers, register this result too
  3. The rax module returns some top level keys of importance, in this case ‘success’ will hold a list of all of the machines that were just created.
    a. You could also run the rax module with an existing exact_count which would return something like rax.instance_ids.instances
  4. add these new servers to a hostgroup using add_host
  5. configure the new servers
  6. run rax_clb_nodes to add in the new servers
  7. Test somehow
  8. using the node information provided in #1, use rax_clb_nodes to mark as draining
  9. Same as #8 but now remove
  10. Run a rax module task again, and drop the exact_count to what you expect (e.g. 10 instead of 20). The way that the rax module works is it deletes the oldest servers in this scenario, which should delete the ones you expect
    a. If you did 3a, you could use a rax task with state=absent and pass in rax.instance_ids.instances to the ‘instance_ids’ attribute, to delete the old servers

Without having tried it, those are roughly the steps I would try to perform.

You might also want to look at

Sorry, I forgot to tick the box that notifies me of a reply.

Yes, thanks, we’re already doing the zero downtime bit and removing servers from the LB before updating them. Your point #10 is the interesting bit, the fact that the module removes the oldest servers first is the piece of information I needed. I’d probably start by adding one new server with the new code, and keep an eye out for errors. Then, if all was well, I’d add the rest and remove the old ones. Or if there were problems, bin the new one.

It certainly sounds like it would be possible, hopefully I’ll get a chance to give it a try. If I have any more specific questions, I’ll get back to you. Thanks!

I’m currently working on doing this as well. The process I am deploying uses autoscaling groups behind a common CLB. Basically, what we are doing is:

  • create the CLB if it doesnt already exist. The clb is identified by a specific naming pattern.
  • create a group containing the IDs of all pre-existing servers attached to the CLB
  • create a new autoscaling group with the new code version, registering them with the CLB and waiting until min_entities are active
  • create a group containing the IDs off all of the NEW servers we just created which are attached to the CLB
  • wait for the new nodes to be ENABLED in the LB (this ensures that their LB healthcheck is passing)
  • drain traffic from the new nodes
  • disable the new nodes but dont terminate them or their AS groups in case we want to roll back

This is working fine in test but hasn’t been rolled out to the real world yet. My next step is to identify the autoscaling groups containing the old servers so I can automate terminating them after the service is healthy for a week or so.

I’ll share what I can in a blog post once it’s working and rolled out.

The load balancer for the service, which the autoscaling groups will be placed behind, is managed as it’s own entity, meaning, it is not created and destroyed as a normal part of upgrading the service.

that should have read

  • wait for the new nodes to be ENABLED in the LB (this ensures that their LB healthcheck is passing)
  • drain traffic from the OLD nodes
  • disable the OLD nodes but dont terminate them or their AS groups in case we want to roll back