-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unclear: should registries permit auto-deletion of manifests when last tag is deleted? #180
Comments
I have this question too - if we delete a tag that is associated with other images, then of course we wouldn't want to delete the image. However if we delete a tag that only is linked to one image, I would expect the image is deleted too, because then you would have an image without a tag (which would still have a version, but no tag, is that allowed?) Finally, if there is a request to delete a manifest, I'd also assume that this means cascading to delete the associated image and blobs. |
Are others implementing registries to share blobs, to be efficient in terms of saving space? E.g., if we upload blobs and they are unique based on digest, if some second image is uploaded with the same digest it could re-use that blob. And then on a delete of an image (via tag or manifest) the blob would only be deleted given that it's not linked to any other images. A simpler approach (but one that is redundant) is to link each blob explicitly to one image, and if the image is deleted via a tag/manifest, we can be sure the blob isn't needed by another one. So follow up question - given the above - if we get a request to upload a blob (and the digest and content type already exist, for some other image) do we do the upload again? It seems risky to have blobs shared because in the case of a chunked upload where the digest is sent at the end, we would need to calculate the digest of the entire image on the server to validate the digest is correct, which is likely more intensive than just calculating the digest of a body for a request (for a single POST, for example). I'm wavering back and forth about whether my implementation should allow shared blobs or not (and what is a safe way to allow that). |
An explicit deletion API makes this more apparent, but this situation happens all the time in normal operation. Since tags are mutable, whenever you push an image to a pre-existing tag, the previous image becomes tag-less. This happens a lot with "latest". Since you can pull images by digest, it's also pretty common for people to be referencing untagged images. If you assume that untagged images are okay to delete, this puts those users in a bad spot.
I'm not sure how you're distinguishing the image from the manifest here. From the registry's perspective, the manifest is the image, IMO.
Often, yes, but this is an implementation detail. |
FWIW, it's possible to resume the state of a sha256 hasher. See https://github.com/stevvooe/resumable and https://golang.org/pkg/crypto/sha256/#New, specifically:
I believe your implementation is in python, but I would suspect there is an equivalent library.
I believe most registries do, yes. Be careful not to allow your implementation to leak information about blob existence across security boundaries: #18 (comment)
This is certainly a hard problem and how you implement this depends on the features/guarantees you are provided by the underlying storage layer.
I'd probably start with something simple like this and replace it with a more sophisticated strategy later. Also, be careful of race conditions here. It's possible to accidentally delete blobs that are referenced by manifests if you don't do garbage collection transactionally. Some registries do "stop-the-world" GC to avoid issues here, which is probably the easiest approach. |
@jonjohnsonjr if we are sharing blobs across images (manifests) and we don't need to finalize an association until a manifest is provided, why would we need to provide the name of the repository for any kind of POST request, e.g., |
Just to be clear, my answers are based mostly on experience operating a registry and writing a few registry clients -- I wasn't part of the initial spec writing, so I'm not privy to the "why" of most of these choices beyond what I could intuit.
One reason is that a registry implementation may not be sharing blobs, so you would need
I think that's part of it, yes. You want to reject a request early if the client doesn't have permission to upload images to that repository. It's also useful for the registry to know "where" the blobs should be stored, depending on how you partition namespaces. E.g. example.com/foo/bar might go to a different S3 bucket than example.com/foo/baz. |
So, this is a common discussion. I talked about it a bit here, in the ORAS repo: oras-project/oras#171 (comment) Basically, as time yields on, everyone will have to deal with delete. Look at the latest docker TOS updates. 15 petabytes of data. And, that's not the expensive part. We can't keep data forever, particularly docker images. Sorry, but anyone that says they don't believe in delete isn't being realistic, and I don't think that's what they really mean.
The problem is we've all implemented it with varying approaches. As for layer sharing, that is a registry optimization. Ok, ramblings on a rainy Friday in a pandemic. |
I believe in delete! I'll need to think about the layer/blob sharing - I refactored to use shared blobs this afternoon because for someone implementing a registry with Django they are likely a small academic group doing so on a local filesystem, and storage is more of an issue, and users are probably scoped to the group (that can better be trusted to not do such an attack). But it's still a concern that I have. Is the detail of how it would happen written anywhere?
Roger that! And I appreciate your rainy Friday, pandemic rantings! I'm having a lot of fun working on this :O) Have a good (hopefully not rainy, definitely not Friday, but probably still pandemic) weekend! |
I think that's roughly the point of the "never delete" crowd -- it doesn't cost that much to store these, so why clean them up?
Why not? GitHub just put a bunch of code in an arctic vault!
I think it's a bit more nuanced than that. As an analogy, within the git ecosystem, it's standard etiquette not to rewrite history because you can end up breaking anyone downstream of you. Git is a little different here because the entire commit history is inextricably linked, but I try to follow a similar etiquette with public images. IMO (barring legal concerns), any image that has ever been publicly discoverable via the registry API (i.e. tagged) should be kept indefinitely, otherwise you risk breaking someone who depends on that image. What that doesn't include:
Yeah I'd like to have a discussion about this, possibly just around standardizing response codes and response bodies. I'd love it if we could return a structured error that said something like I guess I'll stop rambling, but yeah this is hard and there doesn't seem to be an obvious answer to any of these questions that will make everyone happy. |
The proposed 1.0 spec includes deletion of tags via the HTTP API. What remains unclear is whether registries should be allowed auto-delete a manifest when all of its associated tags have been deleted. This issue was referenced in the comments to PR #178
@hallyn
The text was updated successfully, but these errors were encountered: