Dynamic/complex inventory - Specific reboot order

So this is a pretty specific case but I am pretty lost on the best way to approach it. Hopefully the hive mind can give me some guidance here.

I work for an MSP and each customer has specific reboot order requirements due to the specific applications we host for them.

For instance;

I have the above .yml variable file that defines the groups for each customer. It then applies those groups as tags on each server in vCenter.

This allows servers to be placed in keyed groups using the VMware dynamic inventory plugin. So that’s all working fantastic.

My problem now is the best way to go about making sure the tasks that need to be run the on the servers (for example, a reboot), are executed precisely in the order of that group. Of course, I could manually specify that, but at the scale (think thousands of servers) with dozens of groups, is not practical.

Does anyone have recommendations on the best way to approach this? I’d like it to be scalable to multiple different customers with different grouping requirements without having to manually specify this that would be most ideal.

Thanks in advance and let me know if you need any additional information.

First, let’s get some text instead of pixels. You’ve got

Hey Todd,

Thanks for the reply. To answer your questions:

  1. I went with that particular structure mostly due to some previous attempts at this setup, the plan would be 1 customer per file. The customer_name portion could certainly be omitted.
  2. Each *_server entry was intended to represent a single server in this example, not a group.
  3. The other reason I included the customer_name portion at the top was as a simple way to differentiate which server the customer belongs to (since I am using dynamic inventory). I also have a tag and matching keyed group, so I could certainly use that instead.
    So based on your recommendations and to avoid confusion:

example.yml

database_server_01:
down_group: 3
up_group: 1
vcenter: vcenter.local
exclude: false
database_server_02:
down_group: 3
up_group: 2
vcenter: vcenter.local
exclude: false
file_server_01:
down_group: 2
up_group: 3
vcenter: vcenter.local
exclude: false
print_server_01:
down_group: 1
up_group: 2
vcenter: vcenter.local
exclude: false

The order is unfortunately not as simple as groups of servers, but rather each server individually has a specific “power up” or “power down” group. An unfortunate holdover from an archaic software that also prevents me from simply using a reboot or win_reboot command to handle this more cleanly.

That clarifies things a bit, for me at least. Before we get back to the original question, where’s the Source of Truth? Are these YAML files generated from the tags in vcenter, or are the vcenter tags set based on these files? Or [gulp] are they kept in sync manually?

Help me understand your original question, which was, “My problem now is the best way to go about making sure the tasks that need to be run the on the servers (for example, a reboot), are executed precisely in the order of that group.” Can you expand on that scenario? For example, if instead of having Ansible you had an admin sitting by a phone. The phone rings and customer “example” requests a reboot. What exactly does that request include? Based on that request and access to the file(s) like the one above, how does your admin decide what steps to take and in what order? Don’t hesitate to include obvious details; they’re only obvious to you! :slight_smile: Explain it like you would to a new intern who’s only here because his Play-Doh dried out. I can’t automate what I can’t explain, and right now I can’t explain your process. Feel free to throw in other use cases / scenarios. I’ve got scroll bars and I’m not afraid to use them.

I would create one vars file per customer and use a variable that sources the customer’s vars file at run time. You also could have a folder per customer that holds all customer specific items.

% ansible-playbook -e customer_name=‘customer_A’

In the playbook source customer_A’s vars file.

vars:
server_groups: “{{ lookup(‘files’, ‘path/to/vars/’ + {{ customer_name }} + ‘.yml’) | from_yaml }}”

OR

vars:
server_groups: “{{ lookup(‘files’, ‘path/to/vars/’ + {{ customer_name }} + ‘/server_groups.yml’) | from_yaml }}”

This loads your customer-specific YAML file as a dictionary into server_groups. It gives you enormous flexibility to add/remove customers over time without changing your playbook.

adding to the playbook I already showed … you now can sort this list by the ‘down_group’ (halt) or ‘up_group’ (boot) attribute to get your ordered lists.

  • debug: msg=“{{ server_order | sort(attribute=‘halt’) }}”
  • debug: msg=“{{ server_order | sort(attribute=‘boot’) }}”

I think that gets you to where you want.

Just to re-enforce what I already provided …

… and you can add the exclude attribute and filter out items where exclude is false …

I managed to generate the compressed list in a single task …

You are a lifesaver. Thank you so much! This gives me a fantastic place to start.

So right now the yaml files are serving as the source of truth.

  1. Create YAML file with information per customer.
  2. Run Terraform to create the categories/tags in vCenter.
  3. Run Ansible community.vmware.vmware_tag_manager to apply the tags to the appropriate servers.
  • name: Create tags
    community.general.terraform:
    project_path: …/…/scripts/terraform
    register: tag_output

  • name: Terraform output
    ansible.builtin.debug:
    msg: “{{ tag_output }}”

  • name: Terraform output
    ansible.builtin.debug:
    msg: “{{ tag_output.stderr_lines }}”
    when: tag_output.failed

  • name: Add tags
    community.vmware.vmware_tag_manager:

omitted details, but applies tags based on requirements

So that’s what that looks like as far as tag management. If you think there’s a better way I’m absolutely open to suggestions.

The reboot order of the servers is dictated by the vendor and their unique combination of third-party applications they run in combination. So for instance, one customer may need a shutdown order of file servers, database servers, batch servers, management servers, web servers; other may need a completely different order.

That’s why I chose the YAML as the source of truth as this can change as the vendor makes updates to the application so it does need to be fairly flexible. As far as the customer is concerned, this is considered just a generic “downtime” for patching, application updates, whatever is needed.

So to answer use your phone analogy:

riiiiiinnggggg
Customer: Hi, application needs to be updated. Can we schedule a downtime for this date and time?
Admin: Why certainly.
–Date and Time Arrives–
Vendor will perform work needed while users are locked out of the system.
Admin will then shutdown servers (perhaps applying patches at the same time) and then bring the servers back up.
Vendor will then unlock the application to end users and go on their way.

Right now, the reboot order is managed via a CSV spreadsheet and a nightmare-ish PowerShell script. The CSV “source of truth” holds the same information I have above in the YAML file: server name, shutdown group, powerup group, and vcenter host. Effective but very difficult to maintain or add new functionality without a complete rewrite.

So the steps and what order are predetermined by whatever source of truth, CSV/YAML/etc.

I hope that gives some more context and happy to answer any additional questions.