The following playbook is what I use at a few customers, in one case we patch about 2700 servers each month. Before we were able to do this on a monthly basis, we had quite some things to clean up and standardize.
The very first time, we used smaller batches so that we could make sure that all init-scripts were present and to communicate with the various (internal) customers wrt. problems. Once all systems are alligned to the same baseline, things become a lot easier, the set of updates is very tangible and we do batches of 50 systems and execute multiple runs in parallel.
In summary, we do:
- Check if redhat-lsb is installed
- Clean up stale repository metadata (optional, we needed to remove leftover Satellite channel data)
- Check free space in /var/yum/cache and /usr (optional, it prevents failures that require to login to find what's going on)
- Update all packages using yum
- Propose to reboot the systems that have had updates
- Check if the system comes back correctly (we also plan to check the uptime, pull-request in queue)
All our systems are connected to the same frozen channels in a Satellite, which makes it a lot easier to manage. Every month we start to update the frozen channel with the latest updates, we then test the process and updates on about 150 internal systems (some of these are crucial infrastructure, so they get the security updates earlier).
The next day we have a meeting with Change Management, Security Governance and Linux Operations and we go through the list of updates (we have a custom tool to compile a list of updates, and the distribution over our 2700 Linux servers of each update). Based on this list and discussion, we decide if patching is useful and rebooting is necessary.
Then we have spread the patching of all systems over 4 days (2 non-prod the first week and 2 prod the second week), in about 12 different timeframes. This is useful to ensure that systems in a complex setup are not patched/rebooted at the same time, and in case of issues we can reduce the impact and have sufficient time to troubleshoot and resolve. Each "wave" takes about 20 minutes, so in essence we patch 2700 servers in roughly 5 hours.
Essential is that all services are properly scripted using init-scripts and clean shutdowns work well, and everything is started correctly. In case of MySQL e.g. it may mean tuning the timeout of the init-script, etc.
Also essential is to get your customers involved in the process and give them control over what systems are part of what wave, whether they control the reboots themselves, etc. Key is to not allow any exceptions, but look for solutions together. We had very little opposition, and once we had proven this mechanism worked, only small changes were made in iterations.
We plan to integrate our firmware-patching playbook into this one as well, twice a year. But this coincides with minor OS updates and patching takes in this case longer than 20 minutes anyway.