Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Attach Git information via image labels #1290

Closed
cdupuis opened this issue Aug 23, 2022 · 14 comments · Fixed by #1297
Closed

Proposal: Attach Git information via image labels #1290

cdupuis opened this issue Aug 23, 2022 · 14 comments · Fixed by #1297

Comments

@cdupuis
Copy link
Contributor

cdupuis commented Aug 23, 2022

As the industry attempts to move security futher to the left, it becomes increasingly important to link built and pushed container images to their origin; ideally via Git metadata. This enables numerous use cases for tools along the software supply chain. Here are just a few examples:

  • Inform contributors about updates to base images via workflows on GitHub
  • Trace images that get deployed to production to their Git source
  • Remediate CVEs by sending PRs to GitHub repositories updating apt-get install etc lines

We are aware that there is currently a lot of great work underway to securly and verifiably attach such provenance information to container images via signed attestations. Additionally similar provenance data is recorded when using remote Git contexts with buildx/buildkit. This work is still very early.

Ideally we would have a very pragmatic solution that could work today without requiring users of docker build to every build.

Therefore, we'd like to propose that buildx starts to record the following pieces of provenance when being run with a context that has a .git directory:

  • Git commit SHA of current checked out HEAD
  • Include flag to indicate state of local clone (dirty)
  • Dockerfile path

From a privacy perspective we believe that storing Git commit SHAs shouldn't represent a concern given what we see other tools doing (eg. go build storing very similar information without requiring opt-in) and images from private repositories can be pushed to private registries.

We propose to store this information in image labels following the naming conventions set out by https://specs.opencontainers.org/image-spec/annotations/ accepting the way label inheritance can complicate things. This convention is already widely adopted by vendors and projects:

Initially, we'd want to make storing the Git information by buildx opt-in via an environment variable switch.

There's a POC with the suggested changes at: cdupuis@cb8253f

Before raising a pull request with the proposed additions, we'd wanted to raise this issue for community feedback.

@jedevc
Copy link
Collaborator

jedevc commented Aug 23, 2022

A couple of queries/points/nitpicks on implementation details - I think the high-level sounds reasonable 😄

  1. Is there any reason you're proposing labels over annotations or attestations? We've recently added support to do annotations in buildkit at the exporter level (and will be adding attestations soon), which have a couple of benefits over labels for this use case:

    • Annotations can per-index, or per-manifest, while labels are only for a single image config, which has relevance for multi-arch builds - since otherwise the git info will be duplicated. If stored as a JSON in-toto blob, then the data would be shared as well between different architectures.
    • Labels inherit, which I think really makes this hard - I can imagine a scenario of using a new buildx to build a base image with git data, and then an old buildx using that image - it would then pull the git data output and propagate it through. Annotations and attestations don't inherit by design, so we sidestep that issue entirely. If popular official images were being built with this setting, but users continue to not upgrade immediately, then we'd see incorrect information populate through lots of the ecosystem.

    Both annotations and (an early preview) attestations should be present in the next release of buildkit. I'm not sure how release schedules will line up, but I imagine that this is something that would have to be released in a minor 0.X release of buildx, and hopefully that would include a new release of buildkit?

  2. To me, this feels like something we'd like to probably put into buildkit, instead of buildx - though I could be wrong. We've already got logic there for doing git things, and overall I think there's a preference for having more logic in buildkit instead of buildx, and it can sometimes be tricky to migrate functionality there over time.

  3. What's the use case for capturing the Dockerfile path? Could this possibly leak information about the directory structure of the builder system, and if so, should we relativise it to the main build context location? I think we should keep the logic for capturing this separate from git provenance if possible - I'm not sure of how it's directly connected to the git provenance (though I do see how they'd be useful to use together from atomist's POV).

@cdupuis
Copy link
Contributor Author

cdupuis commented Aug 23, 2022

A couple of queries/points/nitpicks on implementation details - I think the high-level sounds reasonable 😄

Thanks @jedevc for your feedback. Let me respond inline.

  1. Is there any reason you're proposing labels over annotations or attestations? We've recently added support to do annotations in buildkit at the exporter level (and will be adding attestations soon), which have a couple of benefits over labels for this use case.

I totally agree that annotations are preferable over labels for this kind of information (mostly because they aren't inherited). To me, moving to annotations opens up the following two questions really:

  • using annotations requires the usage of OCI media types for images. Is that something that is broadly acceptable at this point?
  • as you point out, annotation support isn't yet released as part of buildkit. We are trying to find a solution that we can use really quickly. But perhaps a buildkit release with annotation support isn't so far out?

I've updated the POC code to use annotations at: cdupuis@109c305

  1. To me, this feels like something we'd like to probably put into buildkit, instead of buildx - though I could be wrong. We've already got logic there for doing git things, and overall I think there's a preference for having more logic in buildkit instead of buildx, and it can sometimes be tricky to migrate functionality there over time.

I put this into buildx because it seemed like the only place where there is access to the .git folder if there is one. But perhaps that's wrong and buildkit also has access to the filesystem where the command is being run. I'll need to look into that a bit more.

  1. What's the use case for capturing the Dockerfile path? Could this possibly leak information about the directory structure of the builder system, and if so, should we relativise it to the main build context location? I think we should keep the logic for capturing this separate from git provenance if possible - I'm not sure of how it's directly connected to the git provenance (though I do see how they'd be useful to use together from atomist's POV).

Actually there are quite a lot of use cases that depend on knowing where the Dockerfile is within the Git repository. We use this to send update and fix PRs, re-pinning PRs or just to enrich layer instructions with start and end line numbers.

Of course we could separate that out if this seems misplaced here.

Good point about leaking file path information. I like the idea of making sure that we only store relative (to the root of the context) paths.

@tonistiigi
Copy link
Member

buildx or buildkit

While implementation in buildkit frontend is somewhat more consistent it is much more tricky. We would need to transfer the .git directory to the daemon side. Because it can be under .dockerignore, we can't use the build context for doing that, so would need to do something equivalent to --build-context git=.git. That would add quite a big performance penalty, and even that isn't enough to detect the dirty checkout directory. For that, we would need to transfer everything in the current working directory.

Alternatively, we could add a completely new session API to buildkit for this query, but that would mean we don't have a backward compatible solution. Also, because the directory is not frozen, there isn't really any guarantee that this is the same directory that is being sent in parallel with the build context.

Capturing Dockerfile path

Currently in buildinfo, sources are captured automatically(with opt-out) but attributes require -o buildinfo-attrs=true or --build-arg BUILDKIT_INLINE_BUILDINFO_ATTRS=1. If this is set then the Dockerfile path already gets saved into the buildinfo. The question is if we would want to change these defaults or somehow split the attrs into safe(eg. Dockerfile path) and unsafe ones. We didn't enable attrs automatically because there may be users who are incorrectly using build-args to set secret values.


Iiuc, compared to your POC cdupuis@cb8253f that uses env for opt-in you would like the real solution to be enabled by default?

Some considerations for that:

  • I don't think the privacy issue of adding this values to the image is particularly important one. There is a good precedence of Go doing it by default and seems users are OK with it.
  • There is a small performance penalty but not as big as transferring the .git directory.
  • How do we handle the projects that already set the label/annotation today? If they set one in the Dockerfile, they don't want us to reset it automatically.

@AkihiroSuda @cpuguy83 wdyt? Are you ok with client setting these labels(or annotations) automatically based on .git directory, with an opt-out mechanism?

We should still make it clear that users should prefer to build directly from Git URL. It has many benefits like BuildKit providing the actual validation of the Git data or tracking build cache by commit sha. There shouldn't be any case that works better with building from local dir than building from Git URL. But I agree that we have a lot of users, and it is hard to expect all of them to modify their builds.

@AkihiroSuda
Copy link
Collaborator

I don't think the privacy issue of adding this values to the image is particularly important one. There is a good precedence of Go doing it by default and seems users are OK with it.

s/OK/unaware/ 😅

Also, some users who do not use Go might not be OK with storing the value to the image by default.
So opt-in might be safer.

@kipz
Copy link

kipz commented Aug 24, 2022

I'm thinking that ultimately we need to make supply chain attestations (like VCS provenance) and validation of them the default behaviour, much like what browser vendors and letsencrypt have done with TLS, if we want to make significant progress here.

So for me, even if we start with an opt-in, there's still the question of when and how we make it an opt-out.

@jedevc
Copy link
Collaborator

jedevc commented Aug 24, 2022

While implementation in buildkit frontend is somewhat more consistent it is much more tricky.

Makes sense. I wonder if we might want to put the logic into the buildkit client package though, instead of directly in buildx, even if we can't get it into a frontend.

@cdupuis
Copy link
Contributor Author

cdupuis commented Aug 24, 2022

Makes sense. I wonder if we might want to put the logic into the buildkit client package though, instead of directly in buildx, even if we can't get it into a frontend.

@jedevc, I'll take a look at that. Thanks.

How do we handle the projects that already set the label/annotation today? If they set one in the Dockerfile, they don't want us to reset it automatically.

@tonistiigi, That's a good point. Any pointers how to solve this?

We should still make it clear that users should prefer to build directly from Git URL. It has many benefits like BuildKit providing the actual validation of the Git data or tracking build cache by commit sha. There shouldn't be any case that works better with building from local dir than building from Git URL. But I agree that we have a lot of users, and it is hard to expect all of them to modify their builds.

👍🏽

@cpuguy83
Copy link
Contributor

This seems ok as an opt-in.
It would probably be fine as an opt-out but I think that would be better server with OCI artifact types and the referrers api that's been proposed.

@errordeveloper
Copy link
Contributor

errordeveloper commented Aug 25, 2022

... build directly from Git URL. It has many benefits like BuildKit providing the actual validation of the Git data...

Validation can still be done, at least in theory, but it will be subject to repo accessibility. Also, once provenence data is turned into a signed attestation and not just labels, there is no reason to just accept it to be valid and leave validation as an optional (and possibly complex) step that one can undertake separately.

... or tracking build cache by commit sha.

Does it use tree/object hashes or actual commits? Is it documented? If it is based on commit hashes only, I am not sure it is always desirable.

There shouldn't be any case that works better with building from local dir than building from Git URL.

I think there are too many aspects and I cannot see a way of picking what is "better". We can look at repo access credentials to start with, but there is a dozen (or so) of other nuances of how people use git that will be hard to take care of in BuildKit.

@errordeveloper
Copy link
Contributor

This seems ok as an opt-in.
It would probably be fine as an opt-out but I think that would be better server with OCI artifact types and the referrers api that's been proposed.

Yeah, I think opt-in makes sense for the early label-based expreiment. Once it's implemented as a proper attestation, it should be an opt-out.

@tonistiigi
Copy link
Member

@AkihiroSuda

s/OK/unaware/ 😅

I think it was promoted heavily and feedback I saw was positive. Clearly Go maintainers could have left it opt-in as well but that would have meant far fewer users could have benefited from it.

Note that builds from git repository already store this info by default without any opt-in.

@jedevc

Makes sense. I wonder if we might want to put the logic into the buildkit client package though, instead of directly in buildx, even if we can't get it into a frontend.

Not sure what you have exactly in mind here, but buildkit client is very unopinionated. There is no such thing as "main source directory" or smth. so don't know how that would look like.

@cdupuis

That's a good point. Any pointers how to solve this?

Nothing very good. The label sent from the client takes precedence over the one defined by Dockerfile, what is not the behavior we want. With opt-in that wouldn't be a problem of course.

I guess if we go with annotations instead of labels then at least there is no issue with overwriting values but it might still be weird in some extreme cases where a label and annotation could point to a different value.

@cpuguy83

It would probably be fine as an opt-out but I think that would be better server with OCI artifact types and the referrers api that's been proposed.

We have merged the initial attestation PRs on BuildKit side that can be used for provenance, but I don't necessarily think they should be a blocker. These labels have been defined for a very long time, and even the annotation variants have been defined for years. All current tooling only understands labels/annotations.

@errordeveloper

Validation can still be done, at least in theory, but it will be subject to repo accessibility.

It can't be done with regular git tools that don't check file contents. Then .git directory content itself is not deterministic so there is no "valid format" for it. And if the checkout is dirty, it does not mean that the build should fail.

Also, once provenence data is turned into a signed attestation and not just labels, there is no reason to just accept it to be valid and leave validation as an optional (and possibly complex) step that one can undertake separately.

You can not validate the source later, it will not be accessible from the result after the build has completed. Signing does not address the same threat vector as the builder assuring that the content of the sources matches their description.

I am not sure it is always desirable.

It is desirable that you get build cache when reusing the same source.

cdupuis added a commit to cdupuis/buildx that referenced this issue Aug 29, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Aug 29, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Aug 29, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Aug 29, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Aug 31, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Aug 31, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Sep 1, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Sep 1, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Sep 1, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Sep 1, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Sep 1, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Sep 1, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Sep 1, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Sep 1, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Sep 2, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Sep 2, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Sep 2, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Sep 2, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Sep 6, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
cdupuis added a commit to cdupuis/buildx that referenced this issue Sep 6, 2022
as per docker#1290

Signed-off-by: Christian Dupuis <cd@atomist.com>
@errordeveloper
Copy link
Contributor

@errordeveloper

Validation can still be done, at least in theory, but it will be subject to repo accessibility.

It can't be done with regular git tools that don't check file contents. Then .git directory content itself is not deterministic so there is no "valid format" for it. And if the checkout is dirty, it does not mean that the build should fail.

Also, once provenence data is turned into a signed attestation and not just labels, there is no reason to just accept it to be valid and leave validation as an optional (and possibly complex) step that one can undertake separately.

You can not validate the source later, it will not be accessible from the result after the build has completed. Signing does not address the same threat vector as the builder assuring that the content of the sources matches their description.

I am saying that, in principle, once there is a attestation, it can be validated separately. If that had to be done inside buildkitd, it should be somehow possible, albeit not without some additional setup steps, especially around repo access creds. And the validation tools may need to be designed from scratch for this purpose.

I am not sure it is always desirable.

It is desirable that you get build cache when reusing the same source.

I agree on that part, but I my concern was regarding what happend the other way around. I think that a commit that doesn't actually change anything relevant to a particular build shouldn't invalidate cache.

@tonistiigi
Copy link
Member

in principle, once there is a attestation, it can be validated separately.

I don't really understand this. Any attestation definitely can not be validated at all and just needs to be trusted. Eg. if you have a timestamp inside provenance attestation there is no way to not just trust that value. For the source, if there is a way to validate it at all then it needs to be included as full content or can be pulled by checksum.

but I my concern was regarding what happend the other way around. I think that a commit that doesn't actually change anything relevant to a particular build shouldn't invalidate cache.

Yes, this is how buildkit cache works. If the commit is different, then some steps need to run again, eg. we at least need to clone a repo again and calculate checksums over files that the build actually uses. If these steps determine that changes were not relevant then the cache will pick up even if initial commits were different. If commit can be guaranteed to be the same then no clone or file checksum steps (or in some cases you would need to run compiler again to prove that its still generates the same binary) are needed.

@errordeveloper
Copy link
Contributor

errordeveloper commented Sep 9, 2022

in principle, once there is a attestation, it can be validated separately.

I don't really understand this. Any attestation definitely can not be validated at all and just needs to be trusted.

I guess it was a bit inaccurate of me to talk validation, we are really talking about verification.

What I am implying is that one should be able to make interpretation of an attestation and use the information to re-construct at least the key elements of the build process. I am not suggesting that deep verification tools exist today, I would be surprised if theyh did, I'm only refering to the idea of verification in principle.

From https://slsa.dev/provenance/v0.2:

"To trace software back to the source and define the moving parts in a complex supply chain, provenance needs to be there from the very beginning. It’s the verifiable information about software artifacts describing where, when and how something was produced."

Eg. if you have a timestamp inside provenance attestation there is no way to not just trust that value. For the source, if there is a way to validate it at all then it needs to be included as full content or can be pulled by checksum.

There is some use to timestamps actually, if you have access to system logs, you can check if the build has occurred at the said time. The source/materials are specified by URI and digests, which definely makes it verifyible, albeit subject to durability and access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants