[Vote ended on 2024-12-02] Saving space on PyPI

We ran into a problem when we tried to release 11.0.0rc1:

Project size too large. Limit for project ‘ansible’ total size is 10 GB.

@felixfontein requested to increase the limit, but this might take some time. In order to be able to release 11.0.0rc, @gundalow deleted some yanked releases.

In the long run, I think having a higher limit would be the best solution. However, we also should discuss what we can do to save space. One suggestion is to remove old pre-releases. Although pre-releases prior to 2.10 should probably be consulted about with core.

Thanks for starting the discussion
Earlier today, I deleted the following releases which had been yanked previously:

  • 10.0.0: contains extra files which shouldn’t be included
  • 9.6.0: contains extra files which shouldn’t be included
  • 9.5.0: Accidently contains breaking change
  • 9.0.0: (no reason given)

I count ~35 pre-release from 3.0.0b1 (Feb 2021) through 10.0.0rc1 (May 2024) which I think would be safe to delete (though maybe we yank them first? That should give us some space more space back while we wait for the request to be processed.

I’m not a fan of deleting existing releases. Deleting yanked and pre-releases is IMO OK as a last resort (as in the current case), but I would avoid even deleting pre-releases if possible.

3 Likes

@felixfontein Could you please outline your concerns, there may well be an impact I haven’t considered for deleting pre-releases

Someone noticed that 9.0.0 is gone: Ansible 9.0.0 not available anymore from the pip repo? · Issue #500 · ansible-community/ansible-build-data · GitHub

@gundalow: it’s mainly a personal preference, IMO releases are immutable and should stay there forever, if there aren’t very good reasons for not keeping them. (Like having malicious code in them, or some legal reasons.) The pre-releases are part of the release history like every other release.

But I’d definitely still prefer deleting them over not being able to publish new releases :slight_smile:

2 Likes

Agreed that deleting releases shouldn’t be taken lightly, but judging from prior experience, we have no idea when someone might get around to evaluating the PyPI quota increase request. If hard decisions have to be made, I’d suggest starting with the oldest alphas, then oldest betas, and so on, and only as-needed (ie, not pre-emptively killing off all old pre-releases).

1 Like

I like @nitzmahone’s proposal. The oldest pre-releases are for Ansible 2.5, so 2.5.0a1 would be the first release to be deleted once we need more space (but not before that).

Does anyone have better/other suggestions?

To figure out the size of the ansible PyPI repository without having admin access to it, you can use this Python snippet:

import requests
r = requests.get('https://pypi.org/simple/ansible/', headers={'Accept': 'application/vnd.pypi.simple.v1+json'}).json()
size = sum(file['size'] for file in r['files'])
print(f"{size / 1024**3:0.4g} GiB")

The current size is 9.731 GiB.

(Thanks to @nitzmahone for figuring most of this out, I just hacked a script together that summed up the numbers :wink: )

2 Likes

Preciously we have 3 problems in hand :

  1. Monitor the space in PyPI and request increase when needed
  2. Come up with the rules of deletion of released packages (if/when needed and which one to be deleted, the process to be followed before and after the deletion)
  3. How and where to archive the deleted pacakges.

Now for a part of the rule I agree with @felixfontein 's first comment on deleting the existing release.

I wanted to also spell out how people are able hit yanked releases: pip install ansible never considers them during dependency resolution. But if somebody pinned it to the exact release, only then it’ll be installed.
So people affected by fully removing such releases are going to be those who favor reproducible deployments and pin ansible in their requirements files and scripts. We’ve seen one case, evidently, but there may be people having such pins but running their automation periodically. JFTR.

Additionally, PyPI stats are available via BigQuery: Statistics · PyPI. We should be able to inspect it somehow and verify that the releases being removed have low downloads.

Im not sure if I understand your proposal, so just to clarify:

  1. 2.5.0a1, 2.6.0a1, 2.6.0a2… 11.0.0a2, 2.5.0b1… or
  2. 2.5.0a1, 2.5.0b1, 2.5.0b2, 2.5.0rc1, 2.5.0rc2, 2.5.0rc3, 2.6.0a1…

So alphas from oldest to newest, then betas from oldest to newest and then rcs from oldest to newest or generally pre-releases from oldest to newest? I tend to the latter.

I fully agree. Let’s not do this generally, only when (as you put it) hard decisions have to be made in order to be able to do a new release.

The proposal is the former: first delete all a1 pre-releases (from oldest to newest), then all a2 pre-releases, etc.

I hope it won’t come to this, but this could mean we might have to delete the current 11.0.0a1 and 11.0.0a2 releases while keeping 2.5.0b1 which is 6 1/2 years old.

Why is it more important to keep old betas than current / pretty new alphas? I’m open to both, I just want to understand. I would have said the other way round makes more sense.

I don’t see the point in keeping old alpha/beta versions at all - they served their purpose back in their days, but IMO it is extremely unlikely any new value will be obtained from those releases at all.

And, of course, if there is something to be achieved, we could/should copy these files over to a simple httpd server or file server anywhere and remove them from PyPI. PyPI is not a storage solution.

I somehow agree with @felixfontein. I also have a bad feeling about deleting old packages / versions. I think we should keep them, at least for historical reasons. Even if they’re only pre-releases.

On the other hand, I agree with @russoz that PyPI might not be the right place to do this.

I’ve been searching for archiving PyPI packages in case there’s already a solution and stumbled upon Software Heritage. It looks like they have a way to archive PyPI packages. Would this be way to a) delete old Packages from PyPI and b) not loose them completely?

There’s also archive.org.

@SteeringCommittee What do you think?

1 Like

Errr… why not use GitHub itself for storing the released artifacts? Their platform supports it.

GitHub also has size limits. I don’t know how they work and what exceeding them results in; in the worst case (I can think of right now), it could happen that we exceed the size limit of the ansible-community organization, and suddenly we can no longer add commits to any of the repositories in there until we start deleting tihngs. I don’t think size limits would/should work that way, but :person_shrugging:

Current size seems to be 9.814 GiB. I think this is too much, we won’t be able to do the three releases (9.13.0, 10.7.0 and 11.1.0) planned for December 3rd.

Could also (for the newer stuff anyway) favor killing wheels first for prereleases before the sdists…

That’s a great idea. Deleting all wheels of pre-releases should free up some hundrets of megabytes, while the source dists are still there for historical purposes.

So maybe the refined proposal:

  1. First delete wheels of pre-releases, starting with the oldest ones.
  2. Once all wheels of pre-releases are gone and we need more space, start with alpha 1 pre-releases, from oldest to newest.
  3. After that, the alpha 2 releases from oldest to newest.
  4. After that, alpha 3; then beta 1; then beta 2; then beta 3 (did we ever do that?); then rc1; then rc2.
1 Like