-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow registries to reject non-existent subjects in manifests #459
Comments
If clients treat the referrers to a manifest the same as other descriptors in the manifest, then it makes sense to push all content for a given manifest, including referrers, before the manifest itself. It's the client equivalent of a registry ensuring all manifests exists before allowing the index to be pushed. This helps with various client side issues including:
I don't think clients would implement both workflows if we change the spec to say MAY, doing a full recursive copy with concurrency is complex enough without having part of the copy sometimes moving after a wait lock. So this change would result in clients all switching to copying the manifest before the referrers. Even if clients defer tagging the manifest, pushing by digest, pushing all referrers, and then pushing the tag, I think scenarios 2, 3, and 4 from the list above are still impacted. One of the questions that came up when we discussed this before is whether it's appropriate for OCI 1.1 to reject content that was accepted according to OCI 1.0 specs. This change could create a scenario where existing content cannot be included in a transition to 1.1. What would the OCI guidance be for that? Thinking about registries that reject an index if there is a missing child manifest, in my experience, it's possible to delete one of those child manifests after the index is pushed. Taking a similar scenario to referrers, what happens if a manifest is pushed, referrers are pushed, the manifest is deleted, and later pushed again? Are the referrers immediately deleted, do they get reattached, or can the referrers be pulled by their individual digest but not appear in the referrers API? |
To put some numbers to my concerns, I go back to a repository mirroring tool. If it knows the source always includes all metadata before pushing an image (ideally that would become a best practice), then it can do a simple HEAD request on each tag to compare the digest of the index. Without that guarantee, that check becomes recursive, checking every every manifest for a referrer list. To take what I'd hope is a not too distant example, consider an image with support for 10 platforms (golang has that today), and each platform has 5 referrers (2 SBOMs, a SLSA attestation, and 2 signatures). That's 1 (index) + 10 (images) + 50 (artifacts) = 61 manifests to check for referrers (an artifact could have it's own referrer, e.g. a signed SBOM). I'd also need to do a full GET on each of the manifests from the source registry. So that's 61 manifest GET plus 61 referrer lists vs 1 manifest HEAD. Multiply that by the number of tags being copied. And the issue gets worse with more tags, more platforms, and more artifacts per platform. |
This discussion seems to stem from the fact that the artifact reference points the "wrong" way. Digests must exist before the manifest because the manifest points at the digests. Following the same logic, artifacts point at the manifest, so the manifest should exist before the artifact. There are a number of reasons behind the decision to have artifacts function the way they do. But by keeping the MUST here, it comes across like we want artifacts to act like dependencies of the manifest despite functionally existing the other way around. The increase in requests is a downside, but also a consequence of the initial design. |
The direction depends on the perspective. From the server walking the descriptors, the manifest is a child of the artifact. From the client following the referrers API, the artifact is a child of the manifest. |
The images and registry is designed around a merkle tree. The reference artifact ends being the top hash in the merkle tree (even if it may not be the first hash clients request). Requiring a registry to accept only the top hash breaks the ability of a registry to verify the merkle tree before accepting content. In terms of "wrong" or "right" way, the reference design has good reasons for this direction. The "MUST" language breaks existing registry design and best practice though. |
Treating the subject as a descriptor that must be followed in the merkle tree is a design decision of registries that I think is worth deeper consideration. Clients typically will not pull the artifact and then use that that to find the image, typically it will be the reverse. Perhaps it would have been better to have defined the association using a http header rather than a field in the manifest, to make the distinction between what's in and out of the merkle tree more distinct. But I believe we are past the point in the release cycle where a design change like that would be worth considering. |
At least in the signature case, then one would have to push manifest (by digest) first, push the signature and then push the tag. |
While I would prefer, from a client perspective, to keep this MUST to assist in clients having the order option, there was never an intention to break 1.0 image registry storage systems with this MUST requirement. Loosening now seems the only appropriate course of action. |
Maybe this becomes a SHOULD. However, the following is a legitimate use case (cc: @sajayantony)
How does one achieve this? #459 (comment)? |
If nothing in the manifest indicates there even is a signature (since signature points to manifest, not the other way around), isn't this already a concern with the current design? There would be no validation on the registry side, it's only enforced on the client side. So the client should be able to handle it either way. |
Yes, it may not be too bad in reality. Will let @sajayantony @sudo-bmitch weigh in on this one. Thanks for running the conformance tests, btw. |
This was the bit about "transactionality" discussed in a prior OCI call; the thought was that the workflow you describe could be used (as mentioned in the issue body):
Past that, a proper "transactional upload" API seems like obvious future work, to better support existing (e.g. referrers and manifest list) and future use-cases involving any sort of reference. |
It is possible for a client to upload the manifest before the signature, which is something admission controllers would block and alert on. But the concern being raised is that it would now be possible for the registry to block that, so even a well behaved client is not able to push content its preferred order.
This may help a single use case with signed images and admission controllers. But for the more general use case, this still has issues when images are copied by digest, including child images in a multi-platform list. My concern continues to be that a well behaved client, with cooperation of other well behaved clients, would not be able to ensure content is complete without performing a fully recursive query on every manifest. An upload could be interrupted at any point, triggering a future retry. When that happens with multi-platform images, clients know where they left off because of which manifests are already pushed, and can assume that child manifests and blobs have been pushed too. But with referrers, that assumption is no longer valid and every child manifest must also be checked for missing referrers on every single copy. As a result, running a mirror will become a very expensive operation, or extremely error prone. Either every image is checked recursively on every update of the mirror. Or tooling assumes that an image is complete if the manifest already exists, leaving the mirror in a broken state after any referrer copy fails. And anyone mirroring from Docker Hub with a recursive check will hit rate limits very quickly if they are not very careful to ensure the recursive queries are not done from the other registry. The rate limit is also one of the things that would trigger a broken copy if it occurs after the manifest is copied but before the artifacts are copied. From this week's meeting discussion: this is a downgrade for users that are losing a capability they have with OCI v1.0 registries. |
Referrers were designed the way they were intentionally - we can add as many referrers we want for the same digest, we can modify them, all without modifying the manifest. The manifest has no dependency on these things. Signatures are a valid use case - but are a client concern for what conditions it wants to apply and when. Even with a It does sound like there could be a use case for transactional uploads, but this is more of a general concept that could be useful for many use cases.
I'm probably newer to the OCI spec than most here, but from what I've seen, the number of "well behaved clients" is approximately 0, because everyone has a different definition of what "well behaved" means. This isn't bad, clients interpret things differently depending on different use cases. If we want something to be handled consistently, it needs to be codified in the spec. Rate limits on Docker Hub is a separate concern, and one that can be addressed based on changing customer needs & usage patterns, and shouldn't be a deciding factor when trying to decide on different paths. |
I keep thinking about this problem, and I think we're honestly really close to uploads that look transactional today. I'll try to explain further, but hopefully this all makes sense. 🤞 For context, I'll start with the main use case I think is really important for the issue at hand: being able to make sure a signature is available before users try to pull an image (so that policy can be enforced accurately). Without this, we'll have frequent "brown outs" of image pulls as users race the push of the manifest vs the signature, and the frequency of users hitting those edge cases will increase with the number of users (cue my DOI maintainer hat where we've experienced exactly this with previous incarnations of our multi-architecture support and the angry users that generated). Now my proposed (small, incremental) solution! Uploads of all objects can occur entirely by digest, including manifests. The only way users can discover a manifest to pull (using only official OCI distribution-spec APIs) is by:
So if I push an object by digest, the chances of someone trying to pull it before I'm "ready" are very, very low. Thus, it's only tagging that needs to be transactional, right? The only issue I see there is that "tagging" is not a direct action we can perform -- it's a side effect we can achieve by uploading a manifest by name instead of by digest (and manifests might not be small -- possibly as much as 4194304 bytes, which is potentially a heavy upload just to update a pointer). In other (shorter) words, I'm proposing a new (or updated) API endpoint for lightweight tagging of an existing uploaded manifest without having to upload the entire manifest contents again, and I believe this satisfies the need for a transactional API, hopefully in a way that's easy for existing registries to implement. In the signatures example, that means this could be our extended "transactional" flow:
|
FWIW, dropping the MUST here will (if I understand correctly), make it impossible for signatures to be pushed to a repository separate from the thing they reference. This was explicitly marked as a requirement in the user stories at the start of this effort: https://github.com/opencontainers/wg-reference-types/blob/main/docs/REQUIREMENTS.md#content-management
Dropping support for a user-facing requirement because of implementation challenges is understandable but disappointing this late in the game. |
^ the implication is that one cannot "sign" the tag? |
Correct; how would you sign the tag with the existing data model today? |
@sudo-bmitch that is always going to be the case with referrers, as they were created in order to be able to add references at any time in the future, in order to support workflows such as signatures for later approvals, versus build time. This does make mirroring essentially very difficult without an additional API to give a feed of referrer changes, as there is no real definition of "complete" anyway. |
It requires cooperation from the producer to implement, and the registry to support. But I don't believe it's going to be such a rare case. The model fits tightly with existing usage models of registries where layers are pushed before the manifest, and clients stop following the DAG when the reach content they've already seen. And the model fits closely with the current model of producing images where a CI pipeline creates and pushes an image once, without frequent updates (a change would involve running a new pipelines and pushing a new image). In that model, the changes happening to referrers are unlikely to come from upstream, and instead come from the data pipelines, ingesting images from upstream registries, scanning and attaching SBOMs and internal approvals, before pushing to an on-prem registry. And the pinned digest on that index is an important signal to those internal consumers that the image is complete and also that it matches the upstream image. Fundamentally, I believe the disagreement is from the two sides wanting consistency guarantees, but having a different view on what that consistency looks like. Registry operators are defining the DAG with the signature as a root node, where consumers would pull the signature and walk that to the signed image. Consumers are drawing this as two separate DAGs where the image DAG is pulled, and the signature is queried from the referrers API to pull that separate DAG. I think each side is looking for a way to enable the other to get their result while retaining our own consistent model. So the proposal from registries is for clients to depend on load bearing tags. And the proposal from clients is for registries to implement a transactional API. Each of these would be a compromise resulting in non-trivial issues. |
I believe the issue is more because we're trying to enforce social guarantees on a system that was explicitly designed to not have guarantees. If we build systems that expect certain social contracts to be enforced based on the current use cases, we will run into issues when other use cases come up that require different social contracts - what if "ReadMes" or "examples" become artifacts, and they frequently change post manifest push? Maybe need a different class of referrers, that can only be pushed before subjects. |
Not enforced, permitted. We're giving clients the ability to build these workflows. If and when they use that would be up to them. We aren't forcing image producers to push the signature first, we're permitting them to develop that workflow if they want it. If other users want to modify their image later, they have that flexibility. And if the source does the latter, and you want to copy those changes, then you'll need a recursive copy. But if the source does the former, then you have the choice to dramatically simplify the copy of an unchanged image. |
The core issue is related to consistency. (1) Has the full list been uploaded to the registry? And (2) Are all the dependencies correct and available? I believe the deviations from the core Merkle dag data model is the cause for these issues. When the referrers list is an uploaded index (such as in compatibility mode), there is no ambiguity over whether the referrers is a complete set, since the availability of the referrers index is determined by the client and it is also the root of a Merkle DAG. We shouldn’t change or break the data model for referrers, it is unnecessary and causes more problems than it solves. I propose these changes for 1.1.
I believe this solves a number of issues with the referrers API including:
|
Looking over the suggestion, for myself this doesn't offer any value over option 2, so my vote on the issue is unchanged. Concerns I have include:
Given that, I'm opposed to the proposal. I think it would have been worth considering during the working group, but this late in the release cycle, I feel it's too disruptive to the community that has already written so much code both in the working group and now against the RC releases. A registry can decide to not implement the 1.1 spec, sticking with the 1.0, where the subject field is not defined and not part of the DAG, and where clients push the fallback tag as an index. I think gives an identical result to the proposal (content addressability, no separate API, no filtering, all client managed) while allowing clients to push content in either order. Since we have had a lengthy discussion on this, had a vote that's been open for several weeks, and the vote is leaning against this request, my suggestion is to close this issue and move on. I say that with a lot of hesitancy because I'd much rather find a solution where everyone meets in the middle. But in this case, it's been very contentious because there is no middle option that we've been able to find. |
FWIW, putting some thoughts down ...
|
+1 here, I want to point out again that this wasn't a simple oversight or miss, it was a conscious decision to enable a common scenario that is in use today by real workloads - storing signatures/attestations/sboms/etc in a repository separate from the image. This was captured as a core requirement in the earliest stages of this WG and the MUST was placed there for a reason, to support that scenario. |
Changing this from a MUST to a MAY would increase the number of registries that could support it, and still satisfy every other user story. That could arguably be better for interoperability than abandoning it all and relying on the fallback path.
True, but there may some day be a 1.2. If one can't support this due to breaking the data model, can they never include future features either? Will we come up with a different workaround and relax the requirements? |
This is interoperability in name only - or reductio ad absurdum. Making everything optional would mean every server supports the spec, which would therefore mean everything is interoperable. |
At least wrt dockerhub, could manifests with a subject field be routed to a different backend with relaxed constraints (only for manifests with subject fields). Other manifests (including non-image ones without subject fields) continue to land on the same backend as usual so existing constraints are not violated, and understood it is production so don't want to mess with it. You would still have to solve the referrers lookup of course. |
I don't want to get too hung up on this, though it's in large part my fault people are seizing it. I included the mentions of Hub and ECR to make the practical impacts of a change to the data-model (which these registries, for better and for worse, have built into their design) clear, since "neither Hub nor ECR is likely to change short term" or "consider them immutable for practical/argumentative" purposes helps ground the discussion in the concrete reality. That being said, I want to focus on the abstract data-model and the "purity" (or not) of the data-model as what we really want to discuss; Hub and ECR will do what they will (or not) regardless of what decisions are made here, and I don't think it makes sense to suggest implementation changes here. The core issue is the data model, and whether or not a referrer is a weak or strong link. Mandating a weak link for all implementations, when the rest of the spec allows for a weak or a strong link, is the bit where the current spec prescribes a particular data-model that is breaking compared to the rest of the spec/object relationships. |
data models we design follow use cases (vs) data models we design prescribe use cases. |
I can't agree more! The data model is secondary to the use cases it enables. |
This proposal hinges on multiple prerequisites, each of which would need to be true:
I don't believe we have a solution that finds the common ground between the two views here, and we're unlikely to reach that point with more discussion. We've put the issue up for a vote to see if there was consensus, and majority is leaning towards moving forward without a change to the spec. Given the community's desire to get to a release, I'd suggest we either close this discussion or time box it to prevent it from continuing indefinitely. |
In a world where we are building a greenfield application, this is true. However, after we have a working model, any new use cases fundamentally must take into consideration the legacy models and account for changes necessary to support the work. References are a great concept. Sparse manifests is an interesting concept. But in a world where we have an existing data model, there are many ramifications on the registry side and a lot of undefined behaviors to address. Even with a MUST, we cannot pretend like it'll be some perfect world of interoperability - there are several gray areas in the spec that can and will lead to different implementations that clients need to account for. |
This conversation came down to a vote within the OCI community here: #483 The result of this vote was "Opt 1: no change" as described here. To summarize, the OCI community does not wish to allow registries to reject non-existent subject in manifests as part of OCI 1.1+ as proposed in this issue. Can we close this as resolved? |
For historical purposes, the vote result of #459 at time of writing is:
|
@jdolitsky no, the underlying issue still exists and is unresolved. Brandon was willing to put in the extra effort on a change if enough folks said they wanted a change. I'll put in that effort to resolve this issue, I am not expecting a simple revert of #341. The question of whether this is a breaking change likely will come down to the maintainers. I think it is pretty clear, a 1.0 compliant registry is allowed to verify a subject field and now in 1.1 it is broken. |
I disagree that this is a breaking change. The definition of breaking changes was discussed extensively throughout the working group process as well, this feels like an attempt to just throw away all of that work at the last minute. References to those discussions: |
The What if I said my 1.0 registry was validating |
I just opened a vote for maintainers only: #490 If we reach 5/8 votes in one direction I'm assuming we will all move forward without further discussion. |
This discussion is about changes to the distribution-spec, not the image-spec. Agreed we have discussed backwards compatibility related to the image-spec at length and I am not attempting to debate that. Once again, I ask that we discuss the technical proposals I mentioned and not bring other discussions back here to bring the conversation full circle. I am trying to respect all the use cases and come up with a solution for it, please stop disregarding those efforts. To keep this on topic and make it possible to follow, going to hide the off topic messages. |
Thx for the links, interesting re-read. I acknowledge your frustration and I don't think the discussion is an attempt to throwing any of that work away. from your text in the link ^:
** Object: as defined in distribution spec "one conceptual piece of content stored as blobs with an accompanying manifest." The new MUST language (1.1 main branch) for the PUSH section states "registry MUST accept an otherwise valid manifest with a additionally in the current branch.. "A registry claiming conformance with one of these specification categories MUST implement all APIs in the claimed category." I agree with Brandon's response #340 (comment) copied here:
As Brandon states both SHOULD and MUST do allow a registry to support receiving artifacts from a client that refer to an image that is not local.. It's more a question of client assumptions regarding universal support by all registries. Let's see what Derek comes up with. |
There have now been 2 separate polls to resolve this issue: In both instances, the majority is in favor of not changing the current language in the specification. This issue has been open now for 3 months. At this time I would like to request that @opencontainers/distribution-spec-maintainers respect the poll results and move forward with a release. |
Yet the language in the specification is still changing, still has open PRs and both image and dist have known conflicts with regard to the MUST accept language being voted on. Please respect the issue, review, and voting process (FYI votes such as 490,.. are not a simple majority they are 2/3). Let's see what Derek proposes. As that work continues, we can "fix" the conflicts found many of which would be fixed the same way with or without MUST accept language, whatever that means. |
Any updates here? |
Based on the December 14th meeting, I believe we want to either close this or change the milestone so it's not blocking the 1.1 release. @dmcgowan are you in agreement with that and do you have a preference for which? |
Moved into the v1.2.0 milestone to clear out 1.1 as discussed on today's call |
During conformance testing it was found that registries which require strong references between manifests and blobs fail conformance due to
MUST
language in the spec requiring acceptance of a manifest referencing a non-existent subject manifest. While subject fields may be described as a weak reference, listing and querying them at large scale may require a strong reference (such as foreign key in a database) or may simply be inheriting the data model used in 1.0 which always had referenced objects (as viewed from the merkle DAG) uploaded first.The arguments for MUST language was to (1) support registries which may have reference only repositories, storing content elsewhere, and (2) ensure referrers exist at manifest pull time since there is no atomic way to upload referrers with manifests.
For (1) the burden will be on the client to handle this case on upload, as a registry is not required to support such repositories.
For (2) clients can retry or check for freshness when validation is a requirement or clients can ensure tags are only updated once all content is available. Similar issues have occurred in the past with multi-platform images. If images were uploaded before all platforms were available, then clients could see a race condition between the platform they need being built and the image they pull having that platform available. The same solution could apply here, use push by digest or a temporary tag when pushing manifests that should not be considered fully available and "tag" it once complete (via upload of the manifest using tag reference).
Changing the language the MUST to MAY makes most sense here. Additionally we can add guidance in the spec on how to perform manifest uploads more transactionally. In the future we could consider a more explicit way to create and manage transactions.
Related to
#340
#341
The text was updated successfully, but these errors were encountered: