General question on implementing a gradual rollout service

Hi there,

I have a general design questions on implementing a maintenance deployment service. For the design of this service, I have the following goals:

  1. Compared to simple one shot deployment, I would like to design an abstraction layer to gradually rollout the deployment to all nodes with target state. The rollout process itself will watch the state of the host, and use ansible playbook as the tool to cover them.
  2. The rollout process involves smart scheduling, including the understanding of different node pool inside k8s (we have divided the k8s nodes into groups for different teams, and our current concensus is to at most perform maintenance on 1 node of each group to avoid disruption).
  3. The rollout process also will need to be smart enough to detect the nodes to be impacted, check if there is any client traffic running on it, prepare the nodes to be ready for maintenance (condon, drain, etc), and perform custom test after certain node finished maintenance.
  4. The rollout process can also be tracked with clear progress, and audit on the actions.
  5. Rollout process config should be managed for each individual maintenance plans, with approval capability.

Our internal stack relies heavily on ansible for infrastructure automation, so I’m checking the extensibility of ansible awx to be integrated into my current design. From my current observation in the current implementation, I think it is perfect for a simple deployment tracking and management purposes. However, is that easy to integrate the above design on top of the ansible awx to achieve my goal?

Thanks in advance.

Jianan.

It wasn’t clear to me what you meant by this part until I read the rest of your post, which seems to indicate you’re predominantly interested in performing maintenance on k8s cluster nodes. Is that correct? When you talk about deployments, I assume you mean app deployments, but again my interpretation from the rest of your post is you’re mainly interested in performing k8s cluster activities in an ordered way.

Much of this part can be answered by “it depends”. It depends what you mean by “smart enough to detect the nodes to be impacted”. Who is initiating the maintenance? Are they able to trigger it on a whim? When they perform maintenace on a kubernetes node, what are they inputting to make it happen? What is your source of truth for being able to “detect the […] impact[..]”? Is it a CMDB? I have never worked at a company so far where IT didn’t despise the CMDB and derisively laugh at how inaccurate it was (despite the techs never making the effort to improve the accuracy, but that’s a digression and a tale as old as time). For determining current traffic, it depends on how you’re front-ending your traffic. If there’s a load balancer in front of your kubernetes cluster, you’d need to check what API integrations exist with or without an Ansible module. You can definitely cordon/drain k8s nodes, quick google search reveals kubernetes.core.k8s_drain module – Drain, Cordon, or Uncordon node in k8s cluster — Ansible Community Documentation for example. You can perform tests after a node is done being maintained, it’s up to you to define smoke testing…

It depends on what you mean by tracking with “clear progress”. You mentioned an abstraction layer, so I don’t really know what you’re thinking from an architectural standpoint. You can have Ansible report what node you’re currently taking down before performing the operations, but Ansible’s output is a bit ugly for non-techies to look at. So who is your audience and what is their expectation on visuals?

Ansible has workflows that can be approved (as is my understanding, I never leveraged them myself), or you can perhaps integrate your approval mechanism (servicenow?) to connect to the AWX API to launch your maintenance playbook. Again, it’s difficult to be anything but very general without more specific info.

Ansible has a lot of integrations, and for things it can’t integrate with, its URL module exists and does the job for doing REST API operations well enough. You can build your own modules and most AI assisting tools are good at creating them, if you prompt it well enough and are able to interpret its results without going down the route of “vibe coding”.

Hi @mcen1 , thanks for your quick response on this and your detailed narratives and sharing! Let me provide you with more context:

  1. The design I’m targeting at is not just for k8s cluster nodes, but for all services running on the host, although I have to admit 60% of the nodes of our cluster is using k8s, so I may have been overfit my initial statement of my questions to k8s, lol.
  2. Current background: I’m working on the infrastructure for both on-prem and cloud. In my current practices, it is a pretty painful and human-centric operations to deploy changes to all existing nodes after a change (kernel parameters, gpu upgrades, etc) is implemented and tested on a node. There could be traffic on the nodes, and we need to talk to each individual team to understand their availabilities, select nodes, plan maintenance, etc.
  3. Proposed design: I would like to propose an operator like idea to this process: (1) when a new change is planned to be rolled out, it will be categorized as maintenance which could programatically be recorded with maintenance scripts (mostly ansible), post maintenance verifications (individual team provided hook) (2) infra engineers will perform the initial verification for the change first, then it will be handled over to the gradual rollout service for the rest of the nodes deployment (3) each team will nominate the canary nodes among their node pool, and the rollout service will perform the maintenance test by proceeding on these nodes 1 node at a time. After the rollout, the basic node functionalities will be verified (both infra, and per team verification methods) before proceeding to other nodes (4) after all the canary nodes for each team are upgraded, the schedule will pause for certain period of time, and then proceed with rest of the nodes after admin approved (5) when there is a node having issue after maintenance, the whole maintenance schedule is paused until the issue is resolved. (6) The rollout process will perform the deployment on nodes whenever there is no traffic is running, and explicitly isolate it (cordon for k8s, and LB/app-based traffic isolation), and proceed with the maintenance.
  4. Because it is running continuously to upgrade, the scheduler will try to pick the nodes based on the above design and determine how to proceed based on the current status. For infrastructure engineer or client perspective, the service should provide good sense of tracking on which nodes have certain deployment done, and the status of the feature deployment (percentage) for the cluster, and if there is any error needs intervention.
  5. Our current SOT for config management is primarily on ansible at github, and inventory management is through netbox. Runtime status is normally collected through monitoring services, and compared to intent-config in ansible for tracking purposes.
  6. For approval, I think it is primarily for deployment owners and client team to determine whether we should (1) start a maintenance deployment (2) proceed during error state (3) proceed after the canary deployment state.

We are actually vibe-coding a lot of stuff, so I believe the implementation should not be a big issue. However, I would like to discuss and understand awx thoroughly before the actual implementation since I’m still new to the ansible AWX.

My man, if you’re envisioning tackling a project of that scope with vibe coding, I really do not envy the poor soul who’s tasked with supporting it as turnover. Good luck!

lol, vibe coding is just for the initial draft, and seriously the code will get scrutinized when getting into production for modularity and future maintenance :slight_smile:

That says, I know this will be a pretty big scope and needs a careful design. Do you have more suggestions from your end on it? Thanks.