I have a bunch of scripts I inherited that do a variety of things, primarily for setting up new environments in AWS.
Using v2.4, everything worked just fine. As of v2.5, modules are applied to the currently running host instead of the target instance. Needless to say this caused a really big confusing mess until we figured out what happened and forced everything to stay at v2.4.
For example, we have a playbook of:
hosts: localhost
gather_facts: no
roles:
{ role: create_web_tier, tags: ‘web’ }
The role includes a ‘create_ec2_instance’ task which does the following:
The task applies a number of changes such as yum updates, etc.
But as soon as ansible is updated to 2.5+, this behaviour breaks and instead applies all those server settings to the local host running the playbook.
So far our only solution has been to block updates and to keep Ansible at 2.4, which is far from ideal.
Does anyone have any insight as to why Ansible’s behaviour would change so fundamentally? This is a catastrophic disruption that has seriously shaken our confidence in Ansible.
I’ll skip the story since it’s long, convoluted, and frustrating, but it boils down to the fact that the ‘upgrade’ was unintentional and cost us three weeks of head scratching.
Thanks for the info. Now that I know how capricious Ansible is, we will need to reconsider how it is used, and how heavily. This kind of fundamental instability make Ansible a very high risk product to use.
In three weeks of head-scratching you didn’t realise the version of
A
nsible had changed, meaning you allow uncontrolled upgrades to your production systems?
Or in three weeks of head-scratching, knowing that Ansible had upgraded, it didn’t occur to you to read the release notes?
your issue was in " but it boils down to the fact that the ‘upgrade’ was unintentional and cost us three weeks of head scratching.". Ansible docs, changelogs and release announcements talk about all the changes that happened. If its something that happened in your environment then its not really ansible’s fault.
The issue was that the person that originally wrote the scripts quit, and we were parachuted in as emergency resources to try to get everything going. And Ansible just happened to put out an update to 2.5 in the same time frame, so when we did ‘yum install ansible’, we had no way of knowing that the version we were installing was different from the one he was using.
But you’re right, it was completely our fault. We should have known (apparently by osmosis) that Ansible changes it’s APIs capriciously with every single version, that backward compatibility is for losers, and that all Ansible packages are treated as minor in-place updates so if you haven’t locked down the version ahead of time, a routine yum update will break your entire setup.
I mean, heaven forbid that someone might expect a critical operations tool like Ansible to have a modicum of stability between minor releases.
I think you misunderstood our comment and taking it a little far. They are going to be changes to software, all of them makes changes. But, like mentioned before we have docs and notices for this exact reason.
My point is that that’s not good enough. It’s not acceptable to make wide-ranging fundamental changes to the system, and then say “It’s in the docs. You need to read the docs.”
If Ansible is going to make changes so significant that it is guaranteed to break existing playbooks, then it cannot follow the traditional upgrade path where you simply update the version number of the rpm/deb/whatever, and toss it into the pile for upgrading. Each release needs it’s own versioning as part of the name (ie: have ansible2.5, ansible2.6, etc.) so that people can be assured that when they do a yum update, then their stuff will still work afterwards. Apache did this when they went from 2.2 to 2.4. People had plenty of issues during the transition, but surprise upgrades that broke their infrastructure wasn’t one of them.
I’ve been doing a lot of reading, and have come across a very large number of people who share the same frustrations. Ansible is supposed to make sysadmin lives easier. Instead, people have had to resort to version-locking Ansible (as we have done), or implementing entire continuous delivery paths just for Ansible. Or they’ve given up on it entirely and reverted back to bash scripts, cause bash scripts don’t break every 3-6 months.
IMO that is absolutely absurd. Infrastructure is not software. It’s infrastructure. It has to be solid, because everything else depends on it. If Ansible is to be a key element in managing infrastructure, it too has to be solid. Why should I use something that adds to my workload instead of reduces it?
It’s almost fundamental that you treat all of your deployment and systems tools in the same way, they are flaky and will introduce drastic changes that will almost always cause system outages on upgrade (similar to public cloud infrastructure, it fails a LOT). Every tool, (ansible role) playbook and inventory, api reference, software package EVERYTHING needs to be versioned. Blind updates are going to wipe you out and if other people (business revenue too ) rely on the system it’s even more critical to lock everything down to a version.
You will experience this time and time again with various tools, OS’s, a lot of Ansible users already have and this is why they are using Ansible in the first place to enforce Ansible version.
This is along the line of what you can implement to ensure Ansible is at a specific version. You can use this outage as an excuse to implement this check across all of your playbooks. This is a hard lesson to learn and a lot of other infrastructure tools take the same approach enforcing a version, especially those still considered beta i.e. terraform.
Part of the problem is that important new functionality and modules come out, but aren’t backported, so if you want to be able to do everything you need to do, you either need to do extra work to backport it yourself (when possible), or deal with the upgrades. It’d be nice if non-breaking updates were brought to stable long-term releases by default.
I don’t think general flakiness is a good excuse for the instability - ansible could be the one that doesn’t cause problems in that case. It’d be a sign of the tool’s maturity.
I don’t mind submitting a proposal and advocating for it, where do I sign?