This allows servers to be placed in keyed groups using the VMware dynamic inventory plugin. So that’s all working fantastic.
My problem now is the best way to go about making sure the tasks that need to be run the on the servers (for example, a reboot), are executed precisely in the order of that group. Of course, I could manually specify that, but at the scale (think thousands of servers) with dozens of groups, is not practical.
Does anyone have recommendations on the best way to approach this? I’d like it to be scalable to multiple different customers with different grouping requirements without having to manually specify this that would be most ideal.
Thanks in advance and let me know if you need any additional information.
I went with that particular structure mostly due to some previous attempts at this setup, the plan would be 1 customer per file. The customer_name portion could certainly be omitted.
Each *_server entry was intended to represent a single server in this example, not a group.
The other reason I included the customer_name portion at the top was as a simple way to differentiate which server the customer belongs to (since I am using dynamic inventory). I also have a tag and matching keyed group, so I could certainly use that instead.
So based on your recommendations and to avoid confusion:
The order is unfortunately not as simple as groups of servers, but rather each server individually has a specific “power up” or “power down” group. An unfortunate holdover from an archaic software that also prevents me from simply using a reboot or win_reboot command to handle this more cleanly.
That clarifies things a bit, for me at least. Before we get back to the original question, where’s the Source of Truth? Are these YAML files generated from the tags in vcenter, or are the vcenter tags set based on these files? Or [gulp] are they kept in sync manually?
Help me understand your original question, which was, “My problem now is the best way to go about making sure the tasks that need to be run the on the servers (for example, a reboot), are executed precisely in the order of that group.” Can you expand on that scenario? For example, if instead of having Ansible you had an admin sitting by a phone. The phone rings and customer “example” requests a reboot. What exactly does that request include? Based on that request and access to the file(s) like the one above, how does your admin decide what steps to take and in what order? Don’t hesitate to include obvious details; they’re only obvious to you! Explain it like you would to a new intern who’s only here because his Play-Doh dried out. I can’t automate what I can’t explain, and right now I can’t explain your process. Feel free to throw in other use cases / scenarios. I’ve got scroll bars and I’m not afraid to use them.
I would create one vars file per customer and use a variable that sources the customer’s vars file at run time. You also could have a folder per customer that holds all customer specific items.
This loads your customer-specific YAML file as a dictionary into server_groups. It gives you enormous flexibility to add/remove customers over time without changing your playbook.
adding to the playbook I already showed … you now can sort this list by the ‘down_group’ (halt) or ‘up_group’ (boot) attribute to get your ordered lists.
omitted details, but applies tags based on requirements
So that’s what that looks like as far as tag management. If you think there’s a better way I’m absolutely open to suggestions.
The reboot order of the servers is dictated by the vendor and their unique combination of third-party applications they run in combination. So for instance, one customer may need a shutdown order of file servers, database servers, batch servers, management servers, web servers; other may need a completely different order.
That’s why I chose the YAML as the source of truth as this can change as the vendor makes updates to the application so it does need to be fairly flexible. As far as the customer is concerned, this is considered just a generic “downtime” for patching, application updates, whatever is needed.
So to answer use your phone analogy:
riiiiiinnggggg
Customer: Hi, application needs to be updated. Can we schedule a downtime for this date and time?
Admin: Why certainly.
–Date and Time Arrives–
Vendor will perform work needed while users are locked out of the system.
Admin will then shutdown servers (perhaps applying patches at the same time) and then bring the servers back up.
Vendor will then unlock the application to end users and go on their way.
Right now, the reboot order is managed via a CSV spreadsheet and a nightmare-ish PowerShell script. The CSV “source of truth” holds the same information I have above in the YAML file: server name, shutdown group, powerup group, and vcenter host. Effective but very difficult to maintain or add new functionality without a complete rewrite.
So the steps and what order are predetermined by whatever source of truth, CSV/YAML/etc.
I hope that gives some more context and happy to answer any additional questions.