Ansible Galaxy not functional?

Not at all! That was my response (embock) in the User Help and Social, I’m the PM for the Red Hat galaxy team :slight_smile:

2 Likes

Is there a workaround we can follow?

We have production deployments which are failing as we cannot sync collections.

As a workaround, collections can be installed from Git instead of Galaxy: Installing collections — Ansible Community Documentation

Same for roles: Galaxy User Guide — Ansible Community Documentation

You’ll need to find out the URL for the collections and roles but that’s how it works.

For example, for the collections included in the Ansible community package you can find the URLs and tags/versions here: ansible-build-data/9/ansible-9.0.0-tags.yaml at main · ansible-community/ansible-build-data · GitHub

Edit: This could be an issue if GitHub goes down as well – which it does, from time to time. However there is nothing preventing you from mirroring the git repos to your local/internal infrastructure and then pulling from there instead.

10 Likes

I tried setting using scm: git but I received this error:

ERROR! Collection artifact at ‘/var/lib/awx/.ansible/tmp/ansible-local-254A7paan/tmpnu6ByW/amazon.awsK8qSDe.git’ is not a valid tar file.

EDIT: I’ve realised I made a typo - thank you.

same for me

Thank you @rfc2549, and yes, that’s a good interim workaround. Thank you @cybette as well, we are indeed aware of this issue and all of your feedback on how it’s affecting you and your workflows is helpful for triangulating the issue, so thank you to you all as well!

As cybette said above, we do have some possible solutions outlined and are working on getting them implemented - bear with us, as it is a weekend we may be a bit slower than usual but will keep updating this thread as we progress towards a fix.

Also a heads up that this seems to be a partial outage, so though it may go through in some cases, that may not mean that a full resolution is in place.

2 Likes

Another update here - the Red Hat team has a possible fix we’re working on testing.

3 Likes

I can’t overstate how much we wish you luck! (I just refreshed and the galaxy site came right up; will keep out until y’all say you’re ready for traffic, though.)

I also expect to learn a lot from the post-event synopsys. This should be interesting.

5 Likes

I think this is quite a common problem :slight_smile: There are soooo many CI pipelines that depend on various online services, like PyPI, Ansible Galaxy, NPM, GitHub, Docker Hub, quay.io, Maven Central, various other centralized language package repos various, various other OS package repos, various other container registries, etc. Once one of them goes down, a lot of folks are impacted because of this.

The situation would be a lot better if we’d have more pullthrough caches (which simply provide the versions they cached when the service was still up in case it’s now down), but setting these up and operating them often only becomes pressing in situations like now when something goes down.

I’m also not sure how many pullthrough caches / proxies actually exist that support Galaxy. I’m only aware of @briantist’s Galactory, an Artifactory plugin.

And of course you can host your own galaxy_ng instance, though I’m not aware that you can use it as a pullthrough cache. There’s also @sivel’s amanda, which serves a directory of collection artifacts with the Galaxy v2 API. It also needs you to manually pull the collections before they can be delivered.

Does anyone know of other solutions?

6 Likes

For production servers, it is always good to keep a local backup.

Since the introduction of galaxy-ng it got that much issues, that I switched from using collections from galaxy to cloning git projects and use them directly.

Turns out, for the issue of this weekend this was a good decision.

But now that my CI is failing with uploading collection and role updates, I have to think again of using galaxy at all. :roll_eyes:

I’m using AWX and this gave me the motivation to build my own execution environment with all the collections baked in. Maybe a similar approach could be used in other pipelines that don’t use AWX?

1 Like

I was thinking about this earlier today (b/c unexpected free time leads to such things I guess). How’s this for a fantasy? What if galaxy-ng had an api call that would return a container configured to act as a pullthrough cache, and already configured to use that galaxy-ng instance as its immediate upstream. Not only that, but that container would include the api to pull such downstream containers from it, with it configured as the upstream, etc, etc, etc.

Thank you all for your input here - we’re still testing our current fix, and I don’t have an ETA for it in production just yet. I will update this thread again with our current status and progress toward resolution at 8am EST on Sunday at the latest.

I intend for part of the retrospective of this event to be a process for addressing outages in the future, including plans for more reliable communication. More on that upon resolution of the issue at hand though :slight_smile:

4 Likes

I had toyed around with the idea of pluggable backends for galactory to decouple it from Artifactory. I also was thinking instead to separate the Galaxy proxying functionality into its own package instead, so that it could be used more easily in other projects (and then adding it to something like else like amanda might be little more than importing that package).

More recently I had the idea to maybe extend flask-caching for use in galactory, which could also bring it closer to a CI-type use.

All of these probably will involve a healthy amount of refactoring, something that’s difficult right now in the project because there are so few tests (because writing tests for this is itself difficult), but this CI use case was one of the original design goals, and using it for CI in the absence of Artifactory has also been on my mind.

Funny thing is I hadn’t noticed Galaxy being down at all: in my collection CI I use a custom GitHub Action I created for installing collections directly from git as a checkout (anyone could use this but I don’t version it and don’t guarantee any sort of stability against changes), and in my internal uses I use galactory and the upstream proxying and caching have done their job.

3 Likes

After facing this issue for a couple of day, I looked at possible alternatives and I found this project: GitHub - jctanner/galaxy-mirror: caching mirror for galaxy.ansible.com api.

It’s not particularly fancy and would probably deserve some polishing before being prod ready but it definitely does the job and the dev proposes two versions (python and go) with each their Dockerfile ready to use.

I deployed it directly in LXC container and it’s works pretty well so far in a single user lab env.

2 Likes

It doesn’t look like you can use it as a pullthrough cache, but the documentation says collections can be synced from galaxy.ansible.com.

It’s not as convenient as a pullthrough cache, but for some people it might still be an option. Especially if you don’t specify a version (that is, sync / download all versions) and there’s a way to run the sync job automatically. Like once per day or so.

2 Likes

Hi,
Api Root – Pulp 3 seemed to work for a short time and broke down 2 hours ago again,
Maybe I was not patient enough to wait for a final fix.

Best regards and thanks for the support
Andreas

1 Like

Hello all, no new updates at the moment - our proposed solution is still in testing. I’ll update here again at noon EST.

2 Likes

We have successfully completed testing, and now have a fix in production. We will continue to monitor it for stability, so please let us know here if you’re still seeing any issues.

12 Likes

fwiw, I also wrote up an example using GitHub actions to cache galaxy deps:

7 Likes