CfgMgmtCamp 2026 discussion (8/12): Instant Ansible-test target updates without announcements

As a collection developer/maintainer

Instant Ansible-test target updates without announcements

Latest developments

None that I know of. This wasn’t discussed at CfgMgmtCamp 2026.

Original text

When you use ansible-test integration for integration tests - what many collections are doing -,
you often use container images or VMs (this is limited to some collections using AZP, since you need special access) to run the tests in. If you’re sticking to the default container image, everything is fine.

But if you’re using OS-specific container images, or VMs, you’ll notice that from time to time these are replaced. Until almost two years ago (announcement, discussion), changes were announced, and for a certain amount of time both the old container image/VM and the new one could be used. This allowed collections to take a few days to add the new images to their CI matrix and remove the old ones.

But now this no longer happens. If images are replaced, there is no announcement, and the old image is removed at the same time that the new one is added. So basically from one point to the next your CI matrix turns red, and stays red until you updated it. Adding new targets can take some time, especially if you work on maintaining collections in your spare time and can’t allocate many hours in a row to it.

The main argument has been that this mainly affects devel, and you shouldn’t test against that anyway if you don’t like random surprises. You could use milestone instead. But that one has similar problems, when milestone is bumped. You know these dates in advance, but then suddenly your CI is red and you have to deal with even more changes at the same time.

And that’s not all. The VM images are sometimes also changed for stable branches, in the same manner: no announcement, no overlap between availability of the old and new image. This happened both for RHEL 9.x → RHEL 9.7 and RHEL 10.0 → RHEL 10.1 in the first two weeks of this year.

As an example, when I woke up on January 8th, I noticed the following replacements since the previous evening:

  • devel branch:

    • fedora42 VM and container image → fedora43
    • rhel/10.0 VM → rhel/10.1
    • freebsd/13.5 VM → freebsd/15.0
    • alpine3.22 VM and container image → alpine3.23
  • stable-2.20 branch:

    • rhel/10.0 VM → rhel/10.1
  • stable-2.19 branch:

    • rhel/10.0 VM → rhel/10.1
  • stable-2.18 branch:

    • rhel/10.0 VM → rhel/10.1

(The same happened for stable-2.16, but I don’t have that one in CI anymore with VMs.)

It would be great if these changes could be announced and done in two steps (first step: add new platform and use it; second step: remove old platform). This really isn’t much extra work, and heavily improves the life of collection maintainers who use these images in CI.

2 Likes

I don’t see us moving back to a 2 step process. I do understand the point that it is inconvenient for those consuming them, but introducing a 2 step process in which we add one, wait some amount of time, and remove the old, along with an announcement has a few issues:

  1. It is disruptive to actually getting work accomplished by the core team
  2. It adds a high chance of failing to complete the process
  3. There are also simply times when the one being removed becomes non-functional and had to be removed.

However, maybe there is an opportunity for adding image “aliases” that are just like rhel/10 that are not dependent on the dot release.

Seriously, there exist ways to prevent failure of the completion of the process. This is basic project management, like using issues and labels and shared calendars (when the issue tracker lacks scheduling support) to track things. So I really don’t see that 2 is a valid argument against this.

Regarding 3, if a platform completely stops working, removing it right away isn’t a problem. This is almost never the case though from my experience. So this shouldn’t count as an argument for not doing this for other platforms.

For 1, keeping an entry in a text file that does not affect core at all for 1-2 weeks doesn’t really feel like something that is disruptive to work getting done. I don’t know how much work there is in the internal testing infrastructure (since that’s private to my knowledge and not visible). In most cases, my guess (from similar things in other projects and $JOB) is that there is no real extra work to simply keep the old platforms for a bit longer without touching them. (If they break in that time, well, bad luck for the users of the platform, but that really shouldn’t affect the core team.)

It feels to me that these are just excuses to invest a little amount of time to make live of folks outside the core team easier. Maybe this isn’t true, but I’m mainly saying of how this feels. (And if it would be just this single issue I think it would be less of a problem. But there are other things as well that contribute to this feeling.)

I definitely see both sides of this one. TL;DR: there might be a middle ground that could lean on things we’re already doing…

The sheer number of tiny little tasks that go into a given major core release is probably not obvious to the casual observer- we’ve tried different ways over the years to address them with some combination of:

  • Eliminate (yay!)
  • Make declarative/single source of truth
  • Make atomic
  • Provide CI accountability

Those last two are important for keeping things from falling through the cracks. A great example of this is the relatively recent work that went into new sanity tests for deprecations and standardized deferred deprecations.

Before the recent-ish changes, they were horribly messy- deprecated features would often linger for several major releases beyond their scheduled demise because they were non-standard or someone forgot to go grep for them. Hunks of dead code got left lying around, or we’d miss starting a deprecation clock on something at the earliest opportunity. Deprecations are intrinsically non-atomic, since resolving one occurs multiple releases later, but we were able to compensate for that by adding CI accountability, which has worked out fabulously.

The previous way we were doing CI target removals was non-atomic for the convenience of collection maintainers, and IIRC there were several times where we forgot to close the loop on the removal. We switched to atomic removal to avoid that, which, agreed, sucks for collection maintainers when things suddenly start going red with no other changes. Our (perhaps flawed) reasoning at the time was that most collection maintainers were unlikely to be watching for the warnings anyway, so even with a warning period, most collections would still end up in the same state.

Going back to non-atomic target removals and using something like out-of-band GH issues/labels for the followup kinda works against the “declarative/single-source-of-truth” element.

Maybe there’s a lightweight (waves hands) way to add deprecation metadata to a test target that would both issue runtime warnings when used, and be consumed by a simple sanity test that would automatically start failing after, e.g., beta1. We lose atomicity, but at least it’s back in our faces when it’s time to actually zap the entries. This could probably be done consistently with the way we handle deprecations when a devel version bumps, and might even be able to use the same underlying mechanisms.

1 Like

@felixfontein Based on a suggestion from @sivel, I created ansible-test - Add managed test environment aliases by mattclay · Pull Request #86592 · ansible/ansible · GitHub to implement aliases for managed test environments. These aliases can be used to minimize, and in many cases eliminate, the changes required when test environments are updated.

For example, alpine can be used instead of alpine323 in the test matrix to refer to the Alpine Linux container. Likewise, rhel/10 can be used instead of rhel/10.1 for a RHEL 10 VM.

While this won’t eliminate the need to make changes when tests break due to running on an updated test environment, it will eliminate the need for most version bumps just to continue running the tests.

1 Like