Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate digest from tarball #895

Closed
aelij opened this issue Dec 31, 2020 · 18 comments · Fixed by #896
Closed

Calculate digest from tarball #895

aelij opened this issue Dec 31, 2020 · 18 comments · Fixed by #896

Comments

@aelij
Copy link

aelij commented Dec 31, 2020

Our build system produces tarballs and K8s deployment configurations, later to be deployed to multiple environments. We would like to get the image's digest at build time so we could reference it in the configuration, rather than relying on a tag, which is not stable.

Could Crane calculate the digest from a tarball? From what I gather, crane digest only works with a remote image.

@afbjorklund
Copy link
Contributor

afbjorklund commented Jan 1, 2021

But the digest is from the remote image, so it cannot be calculated from the local image (without being saved with it)

We added some code to save the digest (unlike docker), but that doesn't work if the image has been created locally...

See #703

You will find that "RepoDigests" is empty.

It's a shortcoming of the image model...

@aelij
Copy link
Author

aelij commented Jan 2, 2021

Sorry I must be missing something :) Doesn't the container registry just calculate a hash algorithm on the bytes it receives? After all the digest is identical if I push to multiple CRs. Why can't it be calculated the same way on the client side? Assuming one knows the hash algorithm used by the specific CR one is targeting.

@afbjorklund
Copy link
Contributor

I meant that you need to push the image to the registry, in order for docker to calculate the digest.

Also totally misremembered what we had patched, we still don't save the digest in the local tarball.

@afbjorklund
Copy link
Contributor

afbjorklund commented Jan 2, 2021

Was referring to moby/moby#32016

The registry hashes compressed layers.

@aelij
Copy link
Author

aelij commented Jan 2, 2021

Can't we grab the code that calculates the digest from the container registry implementation and add it to crane? And compress the layers if necessary.

From that post, looks like Bazel does this, so it should be possible :)

@afbjorklund
Copy link
Contributor

afbjorklund commented Jan 2, 2021

The way I have understood it, is that the Id is based on the image itself and the Digest is based on random files on the server.

Similar to when we download regular files, you have your git commit and git archive and you have your .tar.gz and checksum.

And even though they reference the same files, it's not possible to "guess" the checksum without knowing the server etc*.

* As described in "pristine-tar", a small delta remains (with compression artifacts like timestamps and another noise)

@afbjorklund
Copy link
Contributor

afbjorklund commented Jan 2, 2021

I'm starting to think that the Digest is useless for identifying images, and better to use unique tags instead...

We have some users that want to compare "foo:latest" with "foo:latest", and for those we will use Id (locally)

i.e. for minikube we might have one image in the cache on the host, and one image stored in the cluster

So it would be nice to be able to know if we need to upload/uncompress a new image, or if the old one is OK

kubernetes/minikube#10075
(uses go-containerregistry)

@afbjorklund
Copy link
Contributor

afbjorklund commented Jan 2, 2021

It's a common/constant source of confusion, maybe because they use the same algorithm (sha256) or something.

Like in #627, why it goes faster to look up a value locally than having to download the manifest from the registry.

@jonjohnsonjr
Copy link
Collaborator

jonjohnsonjr commented Jan 2, 2021

We would like to get the image's digest at build time so we could reference it in the configuration, rather than relying on a tag, which is not stable.
Could Crane calculate the digest from a tarball? From what I gather, crane digest only works with a remote image.

Yes, crane could calculate the digest of a tarball that the specific version of crane would produce when pushing it. For this to be useful to you, you would need to make sure that the thing calculating the digest and the thing pushing the image are identical. If you were to calculate the digest with crane, and push with docker, we'd have no guarantees of them being the same. This is an unfortunate property of how most tools produce images in the tarball format, but it is not impossible to work around. Depending on what is producing the tarballs, you could even make this problem just go away entirely.

@aelij are these tarballs the output of docker save? Or something else?

I would actually be fine with just adding something like:

crane digest --format=tarball=foo.tar registry.example.com/my/image:foo

Not sure about the exact flags, but you get the point. It would be pretty trivial.

You will find that "RepoDigests" is empty.

Is this something that docker embeds in tarballs? I think it would be reasonable for us to add that, but I hadn't seen it before.

The way I have understood it, is that the Id is based on the image itself and the Digest is based on random files on the server.

This has not been the case for quite some time; however, there is one pathological case where docker will behave in a way that might make this appear to be true. I've been meaning to write this up as a little micro blog post because I thought it was interesting, but since I haven't done that yet, here is as reasonable a place as any, so you get a rough draft:


docker pull && docker push changed my image digest, what gives?

One of the touted benefits of using docker is the immutability of images. Because container images are content-addressable, you get a lot of nice properties for free: security, caching, and distribution are all "easy" problems when dealing with Merkle DAGs for the same reasons that git works really well.

Unfortunately, for historical reasons (something I find myself saying way too often in conversations about container tooling), docker doesn't take pull advantage of these properties. I recently helped debug an issue where a customer reported that GCR was changing the digests of their images. If true, this would be a very serious issue, as it would break any workflows relying on the content-addressability of the images, so naturally I started digging in.

The usual suspects (optional)

Often, this happens when dealing with multi-platform images, but that wasn't the case here.

The next most common culprit is mismatched clients. This isn't actually relevant to the issue, but if you're interested, a detour into a history lesson:

During a brief transitional period, docker started producing and consuming what are called "v2, schema 1 images". These are not to be confused with "v1 images", which were essentially linked lists with random identifiers. Nor should these be confused with "OCI v1 images", which are, for all intents and purposes, equivalent to the modern docker v2 schema 2 images.

This modern format has guidance for registries on how to backwards-compatibly handle an old client that doesn't know how to pull a new schema 2 image. TL;DR, if the client doesn't supply the right "Accept" headers, you synthesize an equivalent schema 1 image on the fly and return it to them. This has made a lot of people very angry and has been widely regarded as a bad move. Because these images are content-addressable, rewriting them on the fly changes the digest of the thing being returned, so it becomes hard to know if the digest of this artifact is even real, much less if it's the "canonical digest" for an artifact.

What makes things worse is that docker itself had fallback behavior when pushing to a registry that doesn't support schema 2 images. If pushing schema 2 failed, docker would fall back to pushing schema 1. (This has since been fixed, but it took a while to land in a release.) Given that these are distributed systems, we see all kinds of random network failures, so failing to push a schema 2 image might fail even if the registry supports schema 2. This had led to some interesting problems, like the official docker images containing manifest lists that reference schema 1 images and cannot be pulled by containerd.

Unfortunately for me, the client versions were the same everywhere, so it had nothing to do with fallback behavior...

Required reading

Instead, this issue was caused by an optimization in docker's pusher. To explain it, I'll need to define a few terms:

  • DiffID: This is the sha256 digest of the uncompressed tarball of a layer's file changeset. It is the "ID" of a "diff".
  • Image ID: This is a unique identifier for an image, usually in the context of a locker docker daemon. This is derived from the sha256 digest of an image's config file. The config file references layers by their DiffIDs.
  • Layer Digest: The sha256 digest of the gzipped tarball, as it would be stored in the registry.
  • Image Digest: The sha256 digest of the image manifest, as it would be stored in the registry. The manifest references the config file by image ID and the layers by layer digest.

I had a hard time understanding that without pictures, so I've embedded some visualizations from the tarball and remote READMEs, both of which provide additional details, if you're interested.

Here's how an image is represented as a tarball:

image

Here's how an image is represented in a registry:

image

Note that the manifest.json in a tarball is not the same as the manifest in a registry. Docker doesn't store the registry's manifest because everything it cares about is in the config file. Also, see how each layer has two identifiers, the diffid and the layer digest. Similarly, each image has two representations -- the uncompressed and compressed forms, so it also has two identifiers: the image ID and the image digest.

From docker's perspective, as long as an image has the same config file, it's going to the the same image on disk once everything has been ungzipped, so it uses the Image ID as the "primary key" for local images. Similarly, because it's going to ungzip the layers, it uses the DiffID as the "primary key" for layers. Docker doesn't store (as of this writing) the registry's image manifest locally, and will just reproduce it whenever it needs to push an image. Because the logic for creating an image from a config file + layers is always the same, the manifest it produces is always the same, so the manifest digest is usually reproducible.

I say usually because it's not always, which is why I'm writing this.

The issue

As I said before, docker doesn't store the gzipped layers or manifest of an image it pulls from a registry, so if you're going to push an image, docker needs to gzip those layers so it can regenerate the manifest. Of course, gzipping layers is rather slow, so that's pretty wasteful. To work around this, docker stores a mapping from a layer's diffid to its digest (and where it's seen that digest). Before pushing any layers, docker will first send a HEAD request for that blob (by digest) to see if the registry already has it. If it does, docker can skip gzipping the layer entirely because it already knows what the digest will be and that it doesn't need to be re-uploaded.

This is great, but also the source of the issue. It's possible for the same diffid to map to different digests because of differences in gzip. If one client uses (the equivalent of) gzip --fast and the other uses gzip --best before uploading the same layer, they will produce different gzip archives for the same diffid. So, the sequence of events you might see:

  1. Client A uploads fast-gzipped-image:v1 to Registry A.
  2. Client B uploads best-gzipped-image:v2 to Registry B.
  3. Client A pulls best-gzipped-image:v2 from Registry B.
  4. Client A pushes best-gzipped-image:v2 to Registry A.

Client A's docker daemon sees that it has uploaded the diffids in this image to Registry A already and does a HEAD request on each layer to make sure they are still in Registry A. They succeed, so it reuses those digests to skip the expensive gzipping. If you compare the digest of best-gzipped-image:v2 in Registry A to the same image in Registry B, you will see that they are no longer the same. From your perspective, one of these registries has changed the digest, but it was actually your client doing this (and not telling you).

Oops! If you use docker anywhere in a pipeline for moving images around, you can run into this.

This is more or less why crane exists: we ensure that the digest is unchanged when copying images.


Similar to when we download regular files, you have your git commit and git archive and you have your .tar.gz and checksum.
And even though they reference the same files, it's not possible to "guess" the checksum without knowing the server etc*.
* As described in "pristine-tar", a small delta remains (with compression artifacts like timestamps and another noise)

While "docker build" and similar tools fall into this case, it is not generally true. It is possible to produce byte for byte identical images, as long as you know what you're doing and build things hermetically. All of the tools that I use have this property, but the vast majority of the internet builds images in a non-reproducible way, so it might be surprising to know that you can. As long as you keep the same gzip implementation and compression levels and strip timestamps everywhere, this is totally achievable. We do it in bazel, jib, and ko to great (cache hit rate) effect. Most dockerfile-based implementations use rough heuristics to try to cache non-reproducible results, which is okay for speed but a poor replacement for the alternative.

I'm starting to think that the Digest is useless for identifying images, and better to use unique tags instead...

I want to make sure that you don't walk away from this issue maintaining this idea. It's caused the docker ecosystem to get stuck in the dark ages, but it doesn't have to be this way.

We have some users that want to compare "foo:latest" with "foo:latest", and for those we will use Id (locally)

You essentially have to because docker often doesn't know the image digest :/

i.e. for minikube we might have one image in the cache on the host, and one image stored in the cluster

So it would be nice to be able to know if we need to upload/uncompress a new image, or if the old one is OK

This is certainly tough. Cache invalidation is hard :)

All that being said... it's going to be nearly impossible to decide on a stable identifier with docker in the mix, but with just tarballs and registries, we can definitely do better. An OCI image layout stores images essentially in the same format as the registry, which helps a lot with computing digests (you don't have to!). We've discussed making it easy to have a docker load-able image layout here: #651

I think that might solve some of these issues, but I'm not 100% sure how minikube expects tarballs to work or what it needs from them. (This is partially what I'm thinking about in #890 (comment))

@afbjorklund
Copy link
Contributor

You will find that "RepoDigests" is empty.

Is this something that docker embeds in tarballs? I think it would be reasonable for us to add that, but I hadn't seen it before.

I was talking about the docker inspect output.

It calls them Id, RepoTags and RepoDigests.

$ docker inspect busybox | head
[
    {
        "Id": "sha256:a77dce18d0ecb0c1f368e336528ab8054567a9055269c07c0169cba15aec0291",
        "RepoTags": [
            "busybox:latest"
        ],
        "RepoDigests": [
            "busybox@sha256:49dae530fd5fee674a6b0d3da89a380fc93746095e7eca0f1b70188a95fd5d71"
        ],
        "Parent": "",

It's not saved to the tarballs, though. (Neither is)

[
  {
    "Config": "a77dce18d0ecb0c1f368e336528ab8054567a9055269c07c0169cba15aec0291.json",
    "RepoTags": [
      "busybox:latest"
    ],
    "Layers": [
      "b8383576921fcf341dc0221e7879d8a807e00b52b3bd22fefb532819109be313/layer.tar"
    ]
  }
]

@afbjorklund
Copy link
Contributor

I think that might solve some of these issues, but I'm not 100% sure how minikube expects tarballs to work or what it needs from them.

We just want something that can be fed into (an imaginary) crictl load, so that we (or k8s) don't have to do a crictl pull later.

Since CRI doesn't have the API to load from cached files (sadly), we invented our own abstraction for all three supported runtimes:

https://github.com/kubernetes/minikube/blob/v1.16.0/pkg/minikube/cruntime/cruntime.go#L96

// LoadImage loads an image into this runtime
func (r *Docker) LoadImage(path string) error {
        klog.Infof("Loading image: %s", path)
        c := exec.Command("docker", "load", "-i", path)
        if _, err := r.Runner.RunCmd(c); err != nil {
                return errors.Wrap(err, "loadimage docker.")
        }
        return nil
}
// LoadImage loads an image into this runtime
func (r *CRIO) LoadImage(path string) error {
        klog.Infof("Loading image: %s", path)
        c := exec.Command("sudo", "podman", "load", "-i", path)
        if _, err := r.Runner.RunCmd(c); err != nil {
                return errors.Wrap(err, "crio load image")
        }
        return nil
}
// LoadImage loads an image into this runtime
func (r *Containerd) LoadImage(path string) error {
        klog.Infof("Loading image: %s", path)
        c := exec.Command("sudo", "ctr", "-n=k8s.io", "images", "import", path)
        if _, err := r.Runner.RunCmd(c); err != nil {
                return errors.Wrapf(err, "ctr images import")
        }
        return nil
}

We normally use scp (or similar ssh variation of cat) to copy the files.

The host is only expected to have minikube, and no other tools needed.

@afbjorklund
Copy link
Contributor

@jonjohnsonjr :

I'm starting to think that the Digest is useless for identifying images, and better to use unique tags instead...

I want to make sure that you don't walk away from this issue maintaining this idea. It's caused the docker ecosystem to get stuck in the dark ages, but it doesn't have to be this way.

I'm starting to give up on Docker (and even more so on Podman) as well, but that's a different discussion...

Thanks a lot for the detailed explanation, and I hope it also helped the original poster about what crane can do.

It reminds me of the discussions that we had with the "reproducable builds" community, about timestamps etc.

I actually thought that we were all using pigz (with random number of processors), and that it was a lost cause.


Wonder if this policy has anything to do with the ancient images I've seen ? :-)

REPOSITORY                          TAG       IMAGE ID       CREATED        SIZE
gcr.io/gcp-runtimes/ubuntu_16_0_4   latest    d74596fc4bc7   51 years ago   149MB

i.e. the timestamp has been removed "Created": "1970-01-01T00:00:00Z",

@jonjohnsonjr
Copy link
Collaborator

jonjohnsonjr commented Jan 2, 2021

Wonder if this policy has anything to do with the ancient images I've seen ? :-)

Indeed! Bazel does the same thing by overriding the creation timestamp:

$ crane config gcr.io/gcp-runtimes/ubuntu_16_0_4 | jq .
{
  "architecture": "amd64",
  "author": "Bazel",
  "created": "1970-01-01T00:00:00Z",
  "history": [
    {
      "author": "Bazel",
      "created": "1970-01-01T00:00:00Z",
      "created_by": "bazel build ..."
    },
    {
      "author": "Bazel",
      "created": "1970-01-01T00:00:00Z",
      "created_by": "bazel build ..."
    },
    {
      "created": "1970-01-01T00:00:00Z",
      "created_by": "/tmp/pkginstall/installer.sh"
    }
  ],
  "os": "linux",
  "rootfs": {
    "type": "layers",
    "diff_ids": [
      "sha256:04ae4f88fec2ac6ce09d1a77c269bee60ff6b87c99dbf915f2592e90c62eda0d",
      "sha256:1aa8b42580db4282464b3a3acd376cf038cb5c0c59c7855eb76098da8129a4e9",
      "sha256:84ff92691f909a05b224e1c56abb4864f01b4f8e3c854e4bb4c7baf1d3f6d652"
    ]
  },
  "config": {
    "Cmd": [
      "/bin/sh",
      "-c"
    ],
    "Env": [
      "DEBIAN_FRONTEND=noninteractive",
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
    ],
    "Image": "372d70139abd161278657f692231a3ac2e50e3ab4d601d01bc7b9a4bc6b81ef1"
  }
}

@jonjohnsonjr
Copy link
Collaborator

We just want something that can be fed into (an imaginary) crictl load, so that we (or k8s) don't have to do a crictl pull later.

Support for docker to load OCI image layouts got stalled in moby/moby#33355 and I haven't bothered diving through all the related issues to see what the current status is, but presumably it will eventually work as docker relies more on containerd? Most newer things are starting to support image layouts, so that might be the way forward, eventually.

@aelij
Copy link
Author

aelij commented Jan 3, 2021

@jonjohnsonjr Thanks so much for the detailed explanation! And the PR :)

I had not imagined this being so complex.

these tarballs the output of docker save

Yes. Would that work? We've migrated to using crane for pushing images.

crane digest --format=tarball=foo.tar registry.example.com/my/image:foo

Is the tag/registry needed here? Because from what I can tell, the digest does not depend on them.

@jonjohnsonjr
Copy link
Collaborator

Is the tag/registry needed here? Because from what I can tell, the digest does not depend on them.

Not necessarily, as long as there's only one image in the tarball. The way I structured that PR was nice because I didn't have to change anything about the crane digest signature other than an optional flag. Making your case work is just slight more complicated, but I wonder if it makes sense to just have a separate subcommand for interacting with tarballs instead of trying to overload the top level stuff...

@github-actions
Copy link

github-actions bot commented Apr 6, 2021

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@aelij
Copy link
Author

aelij commented Apr 6, 2021

/remove-lifecycle stale

x7upLime added a commit to x7upLime/minikube that referenced this issue Jan 21, 2023
Being pkg/drivers/kic/types.go the source of truth for the version of
the container we're using to instantiate our kübernetes cluster in,
it should be appropriate to hardcode here the
imageId(a.k.a. contentDigest) so that it could be later used
as a discriminant to invalidate minikube's cache

contentDigest is the most reliable way to address image content:
if the image is tampered with after push to a registry,
the contentDigest we'd see after pull,
would be different than the one hardcoded here.
It is also part of the image itself, i.e. part of the tar archive;
thus giving us a way to always know if the cache is up to date,
even offline.

distributionDigest is the most reliable way to determine which
image we're looking to pull from a registry; a tag can be detached
from an image and recycled, referencing another one, with different
content.
It is not part of the image itself; it is computed on the image in
compressed state.. and since different engines/mechanisms could use
different types of compression, this digest is totally unreliable
as a way to address content.

[*] refs:
https://windsock.io/explaining-docker-image-ids/
google/go-containerregistry#895 (comment)
https://stackoverflow.com/questions/45533005/why-digests-are-different-depend-on-registry
https://blog.aquasec.com/docker-image-tags -- follow links
x7upLime added a commit to x7upLime/minikube that referenced this issue Jan 24, 2023
Being pkg/drivers/kic/types.go the source of truth for the version of
the container we're using to instantiate our kübernetes cluster in,
the pr should start here..

Initially I thought about hardcoding the contentDigest(a.k.a. imageId)
here as well, to then use it to check against the images inside the
kicDriver.. It later took another turn(we're retrieving it from tar).

Plus a collaborator showed me that it was a bad idea.. maintaining it
here would bean bumping it as part of the image build process.

The idea is based on the following concepts:

.contentDigest is the most reliable way to address image content:
if the image is tampered with after push to a registry,
the contentDigest we'd see after pull,
would be different than the one hardcoded here.
It is also part of the image itself, i.e. part of the tar archive;
thus giving us a way to always know if the cache is up to date,
even offline.

.distributionDigest is the most reliable way to determine which
image we're looking to pull from a registry; a tag can be detached
from an image and recycled, referencing another one, with different
content.
It is not part of the image itself; it is computed on the image in
compressed state.. and since different engines/mechanisms could use
different types of compression, this digest is totally unreliable
as a way to address content.

[*] refs:
https://windsock.io/explaining-docker-image-ids/
google/go-containerregistry#895 (comment)
https://stackoverflow.com/questions/45533005/why-digests-are-different-depend-on-registry
https://blog.aquasec.com/docker-image-tags -- follow links
x7upLime added a commit to x7upLime/minikube that referenced this issue Jan 26, 2023
Being pkg/drivers/kic/types.go the source of truth for the version of
the container we're using to instantiate our kübernetes cluster in,
the pr should start here..

Initially I thought about hardcoding the contentDigest(a.k.a. imageId)
here as well, to then use it to check against the images inside the
kicDriver.. It later took another turn(we're retrieving it from tar).

Plus a collaborator showed me that it was a bad idea.. maintaining it
here would bean bumping it as part of the image build process.

The idea is based on the following concepts:

.contentDigest is the most reliable way to address image content:
if the image is tampered with after push to a registry,
the contentDigest we'd see after pull,
would be different than the one hardcoded here.
It is also part of the image itself, i.e. part of the tar archive;
thus giving us a way to always know if the cache is up to date,
even offline.

.distributionDigest is the most reliable way to determine which
image we're looking to pull from a registry; a tag can be detached
from an image and recycled, referencing another one, with different
content.
It is not part of the image itself; it is computed on the image in
compressed state.. and since different engines/mechanisms could use
different types of compression, this digest is totally unreliable
as a way to address content.

[*] refs:
https://windsock.io/explaining-docker-image-ids/
google/go-containerregistry#895 (comment)
https://stackoverflow.com/questions/45533005/why-digests-are-different-depend-on-registry
https://blog.aquasec.com/docker-image-tags -- follow links
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants