variable precedence seems hopelessly broken

I really like Ansible and have built a large infrastructure around it, but I'm finding it untrustworthy to the point of being unusable.

In the last 9 months, I've reported 4 variable precedence bugs:

https://github.com/ansible/ansible/issues?utf8=✓&q=is%3Aissue+author%3Adstillman+

The first two were marked as P1 and fixed, and the third was confirmed as P2 in December but remains open. The last one, which I reported today, occurs in 1.9.3 but is fixed on devel for 2.0 — and yet devel appears to break one of the P1 bugs (#9498) again, despite my including a test case with the original report (as I've done for all of them). The other P1 bug also disappeared and reappeared a couple times during 1.8 development as other variable bugs were fixed, which seems to be the general pattern for these bugs.

If it's not clear, these are incredibly dangerous bugs in production environments, because they can cause services to silently be rolled out in the wrong location or with the wrong configuration. (I noticed this because a service had been deployed to a directory with the name of another service, resulting in two copies of the service trying to run — though fortunately this was on a dev machine.) The safest solution I've found is to configure different roles on the systems separately using tags, but that somewhat defeats the purpose of a central configuration management tool (and actually doesn't even avoid the P1 bug that's broken again on devel, so I guess I should say the safest solution is not to use variables at all).

It's possible I'm using variables somewhat differently than most people using Ansible — the bugs I've reported all depend on include_vars within a role, which I use extensively — but there seem to be quite a few reports of variable bugs, and none of the issues I've reported have been marked as invalid.

I don't want to abandon Ansible, but I can't keep using it if I can't trust it to deploy services correctly. I also shouldn't have to keep my own set of tests that I run whenever I try a new version just to make sure dangerous bugs that I've reported previously — with those same tests — haven't regressed.

If the current variable precedence system is salvageable (and I'm not convinced it is or should be), it seems like many more integration test cases are needed, all run in separate processes and — needless to say — with new ones added whenever variable bugs are found.

(I think a contributing factor here may actually be the layout of the integration test suite. Most of the test cases I've submitted require multiple roles, but adding those to the current suite would get messy quickly, since there's just a single root directory and single roles directory for all integration tests. I think it'd be much cleaner to use a subdirectory for each integration test, with a top-level playbook in each, to keep all test files grouped together and avoid accidental interactions with other files. That would also make it much simpler to add people's test contributions.)

Anyway, I hope something can be done. As it stands now, I'm nervous every time Ansible runs.

I haven't received a response on this, but since I posted it, the various variable precedence bugs I've reported have continued to reappear and disappear through successive commits. Usually when one is fixed, another is broken. One issue [1] was even closed as a "misunderstanding" (despite the supposedly correct behavior making very little sense) before being acknowledged as a bug (and then being fixed, and then regressing again), suggesting that Ansible developers aren't even clear on how variables _should_ work. Currently, a number of the bugs (including a P1 bug that was previously fixed [2]) are present in devel.

I've provided simple test cases with every bug report and suggested a reorganization of the test suite that would allow them to be easily incorporated. I don't see any point in continuing to report these — or, honestly, in continuing to try to use Ansible — if no effort is made to ensure that these dangerous bugs stay fixed.

[1] https://github.com/ansible/ansible/issues/11996
[2] https://github.com/ansible/ansible/issues/9497

I haven't received a response on this, but since I posted it, the various
variable precedence bugs I've reported have continued to reappear and
disappear through successive commits. Usually when one is fixed, another is
broken. One issue [1] was even closed as a "misunderstanding" (despite the
supposedly correct behavior making very little sense) before being
acknowledged as a bug (and then being fixed, and then regressing again),
suggesting that Ansible developers aren't even clear on how variables
_should_ work. Currently, a number of the bugs (including a P1 bug that was
previously fixed [2]) are present in devel.

I've provided simple test cases with every bug report and suggested a
reorganization of the test suite that would allow them to be easily
incorporated. I don't see any point in continuing to report these — or,
honestly, in continuing to try to use Ansible — if no effort is made to
ensure that these dangerous bugs stay fixed.

[1] https://github.com/ansible/ansible/issues/11996
[2] https://github.com/ansible/ansible/issues/9497

We definitely recognize the concerns about variable precedence. Most
users don't have much to worry about, but some users with very complex
playbook structures can run into challenging corner cases with
variable precedence -- corner cases that have become more acute with
Ansible's rapid and somewhat organic growth.

One of the main goals of Ansible 2.0 is to solve this exact class of
problem. In the particular case of variable precedence, we're pursuing
the following design goals:

1. Limit variable precedence handling to a single section of the
codebase. That makes it harder for weird assignment changes to sneak
in. You can find that code in the VariableManager class [1].

2. Ensure that variable precedence is documented in great detail. In
the past, some details of precedence have been less clear than we
would have liked, so we're firming that up [2]. We will continue to
iterate over this definition until we're satisfied that it's correct,
and then we will document it officially.

3. Ensure that variable precedence is rigorously tested. Remember that
2.0 is still in alpha, and regressions should be temporary, so long as
you help us by reporting them. We do have some unit and integration
tests and we are working on cleaning them up, and we welcome more
tests not covered already by existing cases.

4. Ensure compatibility moving forward. Once we have proper
documentation and testing for variable precedence rules, we will be
able to introduce changes with a strong guarantee that those changes
will not break compatibility.

There is one problem that we will not be able to solve for everyone:
in past versions of Ansible, variable precedence has been subtly
different in some cases from release to release. For the vast majority
of users, those differences won't be a problem -- but in setting the
proper precedence behavior, once and for all, we may end up biting
users who settled on a previous version of Ansible with different
variable precedence behaviors. This is why documenting the proper
behavior, and sticking to it, is such a high priority for us -- we
want to ensure that anyone who has to pay a cost for fixing these
issues will only have to pay that cost once.

If you see variable precedence breakage in the 2.0 codebase, please
report it in Github! We can't guarantee that we will have a fix for
your particular breakage in 2.0, but we can guarantee that we will be
able to tell you why it's broken, and what the proper behavior will be
moving forward.

We know that our continued success depends upon providing a dependable
and transparent codebase that can be useful for everyone from the
novice to the power user. That's what the push to Ansible 2.0 is all
about. Thanks for sticking with us as we cover the last remaining
ground.

[1] https://github.com/ansible/ansible/blob/aeff960d028644c19dd845e51ced14a9bd3709c5/lib/ansible/vars/__init__.py#L46
[2] https://github.com/bcoca/ansible/commit/06969d92b6c9e429defa9295ce78487df8a7d084

--g

Thanks for the response, Greg. This is the thing, though — I've provided those tests, repeatedly, and they've never been used, even recently with 2.0. Developers have even asked me on GitHub whether bugs were still present, which is kind of absurd when I've provided simple test cases. I shouldn't have to run my own private test cases every time I try a new version so that I can let the developers know if bugs have reappeared.

Here's an example of the kind of simple test cases I've provided:

https://github.com/ansible/ansible/issues/9498

That never should have been able to regress without developers noticing (and given that the fix for the regression, two weeks ago, doesn't appear to have been accompanied by any tests, I'm not particularly confident that it won't again).

The existing variable precedence tests are clearly inadequate, and I don't think there's any way they can be sufficient in the existing test layout, with one variable precedence playbook and a handful of roles mixed in haphazardly with all the other test files. I'd appreciate if someone could comment on my suggestion for reorganizing the test suite into separate directories, with individual test cases and all their related files grouped together (which is how I'm testing these locally, after all). It would make it trivial to integrate all the test cases I've provided — I could even provide pull requests — and would help with keeping track of separate issues and documenting the expected behavior. I don't really see another solution here.

- Dan