In the Ansible community and partner engineering team, we would like to ask the community for advice to help us solve the following issue.
Currently, Red Hat partners who joined the program to get their collections certified and available on Automation Hub often face rejections of collection tarballs they upload based on errors from the Galaxy-importer log. This causes a lot of friction in the process we’d like to minimize by giving partners, and the community in general, a solution that will help them catch and fix those errors on their end before uploading collections to Automation Hub or Galaxy.
On Automation Hub, Galaxy-importer performs:
Collection building and basic checks like its metadata validation.
Running Ansible Lint with the production profile.
Running Ansible sanity tests.
Our initial vision on how to solve this
We’ve created the ansible-collections/partner-certification-checker repository for collection certification onboarding which, among a few other things such as a README template, will contain a GitHub workflow we want to encourage partners to use in their repositories.
This solution has the following properties:
The repository is public.
It is a separate repository.
It is kept minimalist: it contains only necessary items (such as jobs in the workflow) for the purpose of content certification.
Let’s now discuss the properties
The repo is public under the ansible-collections org because the larger community can also benefit from it to improve quality of their content before uploading it to Galaxy. The community contributions are welcome there. Many partners are also active community members and have their collections on Galaxy and included in the Ansible community package.
It is a separate repository. However, we have a feeling that there might be some overlap with what we have in the collection_template repo used as a template for initializing new collections repos on GitHub. We also refer to it from our collection inclusion requirements as a source of templates for testing (contains the ansible-test workflow), execution-environment.yml, README, LICENSE, etc. We have considered merging them, but I personally think that keeping the content from ansible-collections/certification/ in its dedicated repo will be less confusing.
The content is intentionally kept as minimal as possible. There are a lot of good and useful GitHub workflows and actions we could recommend (e.g., for releasing, for running integration and unit tests), but we intentionally decided to have only one certification.yml workflow that contains only the checks from Automation Hub. However, from there, we could refer to other community resources such as the collection_template repo if maintainers want to get a workflow for unit and integration tests or consult the community package inclusion requirements to learn community best practices.
We decided to minimize potential points of failure and not to refer to any other reusable workflows/actions except the ansible-community/ansible-test-gh-action@release/v1 one. The Galaxy-importer and Lint checks are pretty straightforward, so we don’t want to depend, say, on any of ansible/ansible-content-actions workflows some of which, in turn, use other tooling such as tox-ansible under the hood. This approach could be reconsidered though if responsible teams make a strong commitment to ensure their stability.
We decided not to include unit, integration or any other unnecessary checks to keep things simple.
There’s also a test module we run the workflow against on a scheduled basis to make sure everything works.
What do you think about this effort and the implementation?
We’d love to hear from you in the comments!
Thanks for this discussion, and the thoughts so far.
I think this need to state why it’s different to collection_template early on. I’m still unsure what specific things could (or MUST) be different. I know you are running tests in a different way, though wouldn’t calling ansible-lint, etc be useful for all Collections?
I think it could be useful to include links to how this could be performed in the GHA.
If ansible-collections/certification is meant to be used as a template, then maybe we need to have different files for Collection’s readme, and how to use this repo documentation.
First, the goal of the workflow is to make it as easy as possible for partners to apply it, i.e. just copy with no modifications.
Second, Lint runs with a very AH specific “production” profile - was developed for AH certification purposes. In general, how we run it here is very AH specific. It wouldn’t hurt other collections to run it that way, but it’s quite restricted compared to default.
This is the plan, but out of the scope of this forum post:)
Described only most problematic things (most common reasons for rejection) in the onboarding precess (though there’s a reference to the partner-facing docs)
Added “Optional” section to list references to some generally-great-for-collection-development, but unnecessary-for-certification resources
The workflow contains only checks that run on AH: they run separately (not by galaxy-importer because of its limitations). Should be enough to catch most of the problems on partners’ sides
It’s ready to use without any modification: just copy-paste to a repo
There’s a test collection that the checks run against in GitHub Actions scheduled and in every PR against the repo
Not to rely on external stuff maintained by other teams as much as possible:
We use only one external action for sanity checks: the rest is very simple to maintain it on our own, i.e. not to depend on others and reduce potential points of failure.
certification/README.md at main · ansible-collections/certification · GitHub is incorrect. The EOL dates for core have to come from the product, not the upstream docs. Certified collections have to continue supporting and testing against the default core release included in the relevant AAP version. This table is probably the better one to link to.
It’s probably worth even mentioning in the README that the certified collection owner should not track the EOL dates in the core matrix on docs.ansible.com as that is a common point of confusion.
I’d also suggest removing all the checks listed in the readme in this section.. We would end up having to update that information in multiple places. It should all be in the certification workflow guide as the single source of truth.
Sorry to come in with dribs and drabs on the comments but I think we should remove the README_TEMPLATE.md as well because once again, we would have to update it in multiple places and will invariably forget. We want a single source of truth.
I’d also suggest adding to the readme that the repo contains a test collection etc so that people aren’t cloning it to create their first collection.
@samccann thanks for the feedback! Your suggestions from the first 2 comments were implemented by @oranod and I’ve just merged them, thanks
I’m not sure galaxy-imporer uses -x sanity on Automation Hub.
We run sanity tests as a separate job, so it’d be extra to run them by galaxy-importer too.
needs to include core 2.15 and 2.16.
I think galaxy-importer on automation hub is using core 2.16. Dunno if that changes things here as that means the ansible-test is from core 2.16.
As we’ve discussed earlier, the goal of the workflow is not to run things exactly the same as on AH because of some known limitations and complications it’d bring. How they run in the proposed workflows is good to catch most of the errors on partners side.
At the moment it is completely not clear what versions of tools are used.
This is just an example - if new ansible-lint released - do partners have to immediately follow all the recommendations. Because workflow run will install new version and report all failures.
Please for each and every tool specify exact version used in AAH check workflow.