Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface Deletions Better #11841

Open
dstufft opened this issue Jul 14, 2022 · 3 comments
Open

Surface Deletions Better #11841

dstufft opened this issue Jul 14, 2022 · 3 comments
Labels
feature request needs discussion a product management/policy issue maintainers and users should discuss

Comments

@dstufft
Copy link
Member

dstufft commented Jul 14, 2022

There's currently a discussion going on about if we want to make any changes to when things are able to be deleted from PyPI, it's not clear how that will turn out but almost certainly there's going to still be situations when things can get deleted from PyPI.

Currently when something gets deleted from PyPI there's no longer any record of it outside of the journals/audit logs. This missing information can make debugging harder for users of PyPI when some file goes missing that used to exist.

It might be worthwhile to expose these deletions in some way, possibly to even give people a way to add a note for why something was deleted.

For Python level tools, if the version specifier allows other files besides the deleted one, it will just silently grab another version. This can paper over a lot of the obvious problems that happen with deletions (but not all of them, since there may not be other files that are acceptable) but this can actually make subtle bugs more frustrating to discover or debug since you may end up with different versions. Pinning it The(Tm) solution to that, but pinning makes it more likely that this error turns into a hard error, leaving people's heads scratching.

For non-python level tooling, a lot of them pin to a specific URL (or a set of URLs to allow for mirroring) and bake that into their downstream build systems.

In some cases, deletions probably go unnoticed by these systems because, as an implementation detail of PyPI, deletions don't actually delete the underlying file from our blob storage, and files.pythonhosted.org doesn't consult the database, it just goes direct to the blob storage. That means that if you know the full URL with the hash in it and are pinned to it, you're currently safe from deletion affecting you BUT, that is an implementation detail of PyPI and is subject to change at any time.

In other cases, downstream wants to be able to construct the URL from nothing but the package name and version, without having to bake in our long URL structure. Those downstreams are relying on a redirect powered by Conveyor, which hits the JSON API to fetch the real underlying URL and redirect to that URL. In those cases, when Conveyor tries to generate the redirect, it gets no information other than the file doesn't exist in the JSON api, which it turns into a 404 with no additional details.

We could try to surface this situation in a better way, possibly providing details in the 404, or replacing the 404 with a 410 or something like that.

I don't really have any specific ideas here, and it's possible that the discussions around restricting deletions end up making this an edge case that isn't really worth worrying about. I just wanted to get it down as something that we might want to do.

@dstufft dstufft added feature request needs discussion a product management/policy issue maintainers and users should discuss labels Jul 14, 2022
@mareeduvihari498
Copy link

If we are talking about a case where we packages can be deleted from the PyPI and what would be the outcome, then I don't think it needs much of attention but what we can do to improve user experience we can show when was the package deleted but rest of the problem need not be considered as the chance of happening is quite less and can be ignoreed

@StevenMaude
Copy link

StevenMaude commented Nov 18, 2022

As a related question, does this behaviour mean that sensitive data persists via pythonhosted URLs, when that data has been published in a PyPI package?

For example, see this PR where someone requested removal of a package entry from the pypi-data repository, after they also deleted the associated package which contained AWS access keys.1

The PR contains pythonhosted URLs in that PR. These URLs are still accessible to me. That behaviour is in line with @dstufft's comment here:

deletions don't actually delete the underlying file from our blob storage, and files.pythonhosted.org doesn't consult the database, it just goes direct to the blob storage.

It's surprising — and I'm not sure if it's documented somewhere where users might read2 — behaviour to me that deleting the package from PyPI does not delete the underlying data.

Footnotes

  1. The blog post by the repository owner gives more context. That repository owner deactivated the AWS key themselves, so in this particular case, the details should no longer be sensitive, although do remain an embarrassment 😳

  2. It's not mentioned in the prompt that users see when deleting a package. If there are real legal issues for deletion, then it might also be necessary to make that data entirely inaccessible. (Deletion is often a tricky problem!)

@dstufft
Copy link
Member Author

dstufft commented Nov 18, 2022

Yes.

It's not actually possible to delete sensitive data wholly once you've released it on PyPI. Even if we deleted things from the underlying storage, there's a large mirror network that near instantly mirrors and often times is configured not to respect deletions.

This means that once it's out there, it's out there. No take backs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request needs discussion a product management/policy issue maintainers and users should discuss
Projects
None yet
Development

No branches or pull requests

3 participants