I’ve been looking and looking and I don’t think anyone has requested this…so here goes…
I’ve got a situation with a fairly large server fleet, with a suitably diverse matrix of roles / locations / etc etc. We use the slicing functionality to do rolling deployments (e.g. upgrade 20% of the servers first, see if anything goes boom and then roll out the rest) and it works quite well.
Except for one thing - the serial keyword makes it difficult to reuse a specific play over multiple environments (the production host group may have 500 nodes, the staging group has 200, the UAT group has 50, etc), as it is a raw (integer-assumed) attribute of the play, set in Play.init()
I’m in total agreement with part of this post: https://groups.google.com/d/msg/ansible-project/nWtHpwJ9eCI/YktzldxJEmcJ - “if the node fails, it fails whilst it is out of rotation, and your rolling update slice is smaller than your total node pool count, so you never hit outage”.
But if your serial is set to be, say, 20 because your prod cluster is 50 nodes, then you will run into issues when you try to do a rolling deployment on a host group that may only have 10 nodes - because it’s the secondary test cluster or whatever - and your deployment script takes all the nodes out of the load balancer at once (this happened to me today and was the impetus for this post).
I’ve been eyeballing the codebase and I’m wondering what the implications would be to have the serial (maybe the max_fail_pct attribute too? Whilst I’m in the rolling-update zone) attribute accept a variable instead of a raw integer? This would enable you to specify serial at a host group level - so as you add more to e.g. your webserver host group, the serial batch size would grow with it.
Cheers
-Howard