Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add repositories files to image tarballs #526

Closed
smukherj1 opened this issue Sep 13, 2019 · 8 comments · Fixed by #536
Closed

Add repositories files to image tarballs #526

smukherj1 opened this issue Sep 13, 2019 · 8 comments · Fixed by #536
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@smukherj1
Copy link
Contributor

Create a repositories file for image tarballs generated by the v1.tarball package. Image tarballs generated by docker save includes this file and our internal repo requires this because this file is part of the v1 & v2 schema.

For the image l.gcr.io/google/bazel@sha256:97bfeed0303cae14af7e8f66aad6c13f00b2b33081c59d0f4258717b8b94efec, the repositories files looks like:

{"l.gcr.io/google/bazel":{"latest":"626b494fdfba1950ebdf1ad5cc2e799879ec78ab8bb1bae11de6c9491fbab6cf"}}

This basically appears to be a map from the image name to the tag to the digest of the top most layer. This is currently blocking bazelbuild/rules_docker#580

@jonjohnsonjr
Copy link
Collaborator

jonjohnsonjr commented Sep 13, 2019

This might actually help a lot with #517

Nevermind, this just has the image digest, not the layers.

@jonjohnsonjr
Copy link
Collaborator

Some more context: containers/skopeo#425

@jonjohnsonjr jonjohnsonjr added good first issue Good for newcomers help wanted Extra attention is needed labels Sep 13, 2019
@smukherj1
Copy link
Contributor Author

This might actually help a lot with #517

Nevermind, this just has the image digest, not the layers.

Minor correction. I believe it's the digest of the top most layer. So essentially a layer digest but I don't think it will help much with the optimization PR. Maybe the pusher can avoid calculating the digest of the top most layer if the repositories file is present but that's it.

@jonjohnsonjr
Copy link
Collaborator

Minor correction. I believe it's the digest of the top most layer.

That seems odd, but v1 images were basically linked lists, where each layer could be a complete image and referenced its parent image, so I can believe it. I'd be surprised if that's the case, but I'd like to see how this works when docker saveing various types of images:

  • manifest list
  • schema 2
  • schema 1

@smukherj1
Copy link
Contributor Author

The python implementation to extract this value is here. If you follow it, you'll see it's used to extract the value returned by the top function on docker_image here whose documentation suggests it's the layer id which I'm assuming is the digest.

The above impl is specifically for v1 images so it's possible the value is different for v2 & manifest lists.

@jonjohnsonjr
Copy link
Collaborator

Just took a look at this and it's not clear to me how to proceed. That digest value isn't present in the schema 2 manifest or config ☹️

In order to replicate docker's (or containeregistry's) behavior, we'd need to convert schema 2 to schema 1, then schema 1 to v1, then take the top layer's value. Converting from schema 2 to schema 1 is nontrivial, but I've done it before. I don't have any idea how to convert from schema 1 to v1 (that was deprecated way before I started caring about container trivia).

How is this value actually used? Why do we need it? We could use the digest of the top layer from the schema 2 manifest, but I suspect that wouldn't work out.

@smukherj1
Copy link
Contributor Author

Just took a look at this and it's not clear to me how to proceed. That digest value isn't present in the schema 2 manifest or config ☹️

In order to replicate docker's (or containeregistry's) behavior, we'd need to convert schema 2 to schema 1, then schema 1 to v1, then take the top layer's value. Converting from schema 2 to schema 1 is nontrivial, but I've done it before. I don't have any idea how to convert from schema 1 to v1 (that was deprecated way before I started caring about container trivia).

How is this value actually used? Why do we need it? We could use the digest of the top layer from the schema 2 manifest, but I suspect that wouldn't work out.

Yeah you're right. It's not as simple as just putting the digest of the top most layer in the repositories file. I took a look at the python containerregistry code and I believe it's the digest of the top most layer where the digest is the v1 layer digest. The _GenerateV1LayerId function here is generating this v1 layer digest. It seems to be a chained digest where the v1 digest of the current layer is sha256sum(curLayerV2_2Digest + " " + prevLayerV1Digest). For the top most layer it seems to be sha256sum(curLayerV2_2Digest + " " + prevLayerV1Digest + " " + rawConfig).

So I'm guessing it should be possible to generate this digest without going through the v2_2 -> v2 -> v1 conversion.

As for how it's currently used, I found a bunch of cases in our internal codebase that load the docker image built by rules_docker as a v1 tarball. I have replaced a few but there might be a bunch more. The one that's going to be tricky to fix is the python containerregistry tests that test the v1 -> v2 compatibility layer. So I was hoping to just generate this file if it's simple enough instead of locating all tests & negotiating fixes with their owners.

@jonjohnsonjr
Copy link
Collaborator

So do we need to be able to read and generate v1 tarballs? There are two tarball formats here, and we generate/read the more "modern" one in ggcr. I'm not sure how much we have to do to unblock bazelbuild/rules_docker#580, and I'm not sure what the easy path forward is (some changes here, some changes in the python implementation, change callers, etc.)

Summarizing the differences:

crane

Let's take a look at what crane produces:

$ crane save busybox crane.tar && mkdir crane && tar xf crane.tar -C crane

$ tree crane
crane/
├── 7c9d20b9b6cda1c58bc4f9d6c401386786f584437abbe87e58910f8a9a15386b.tar.gz
├── manifest.json
└── sha256:19485c79a9bbdca205fce4f791efeaa2a103e23431434696cc54fdd939e9198d

0 directories, 3 files

There's a manifest.json file, that points to the config, where we pulled the image from, and the layers. Note that these values point to files within the tarball, not necessarily their digests.

$ cat crane/manifest.json | jq .
[
  {
    "Config": "sha256:19485c79a9bbdca205fce4f791efeaa2a103e23431434696cc54fdd939e9198d",
    "RepoTags": [
      "index.docker.io/library/busybox:latest"
    ],
    "Layers": [
      "7c9d20b9b6cda1c58bc4f9d6c401386786f584437abbe87e58910f8a9a15386b.tar.gz"
    ]
  }
]

The config file is the normal config, from the registry:

$ cat crane/sha256\:19485c79a9bbdca205fce4f791efeaa2a103e23431434696cc54fdd939e9198d
{"architecture":"amd64","config":{"Hostname":"","Domainname":"","User":"","AttachStdin":false,"AttachStdout":false,"AttachStderr":false,"Tty":false,"OpenStdin":false,"StdinOnce":false,"Env":["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],"Cmd":["sh"],"ArgsEscaped":true,"Image":"sha256:758a17a836a4c09586a291c928d1f0561320e252d07c4749e14338daefe84b18","Volumes":null,"WorkingDir":"","Entrypoint":null,"OnBuild":null,"Labels":null},"container":"e30cd53834b3dfdb989c63cc73f4f31f404c7a6a0c0e9d6b9e3e8451edd72596","container_config":{"Hostname":"e30cd53834b3","Domainname":"","User":"","AttachStdin":false,"AttachStdout":false,"AttachStderr":false,"Tty":false,"OpenStdin":false,"StdinOnce":false,"Env":["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],"Cmd":["/bin/sh","-c","#(nop) ","CMD [\"sh\"]"],"ArgsEscaped":true,"Image":"sha256:758a17a836a4c09586a291c928d1f0561320e252d07c4749e14338daefe84b18","Volumes":null,"WorkingDir":"","Entrypoint":null,"OnBuild":null,"Labels":{}},"created":"2019-09-04T19:20:16.230463098Z","docker_version":"18.06.1-ce","history":[{"created":"2019-09-04T19:20:16.080265634Z","created_by":"/bin/sh -c #(nop) ADD file:9151f4d22f19f41b7a289e87aa9cfba3956ffd27746cb3b171b9bd2cb7e6c313 in / "},{"created":"2019-09-04T19:20:16.230463098Z","created_by":"/bin/sh -c #(nop)  CMD [\"sh\"]","empty_layer":true}],"os":"linux","rootfs":{"type":"layers","diff_ids":["sha256:6c0ea40aef9d2795f922f4e8642f0cd9ffb9404e6f3214693a1fd45489f38b44"]}}

We save the layer in its gzipped form:

$ cat crane/7c9d20b9b6cda1c58bc4f9d6c401386786f584437abbe87e58910f8a9a15386b.tar.gz | gunzip - | sha256sum
6c0ea40aef9d2795f922f4e8642f0cd9ffb9404e6f3214693a1fd45489f38b44  -

docker

There's a lot more stuff here:

$ docker save busybox > docker.tar && mkdir docker && tar xf docker.tar -C docker

$ tree docker
docker
├── 19485c79a9bbdca205fce4f791efeaa2a103e23431434696cc54fdd939e9198d.json
├── 65836406f9479e26bb2dc27439df3efdae3c298edd1ea781dcb3ac7a7baae542
│   ├── json
│   ├── layer.tar
│   └── VERSION
├── manifest.json
└── repositories

1 directory, 6 files

There is a similar manifest.json file:

$ cat docker/manifest.json | jq .
[
  {
    "Config": "19485c79a9bbdca205fce4f791efeaa2a103e23431434696cc54fdd939e9198d.json",
    "RepoTags": [
      "busybox:latest"
    ],
    "Layers": [
      "65836406f9479e26bb2dc27439df3efdae3c298edd1ea781dcb3ac7a7baae542/layer.tar"
    ]
  }
]

The config file has the same contents, just a different name.

The "Layers" points to a layer.tar in a 65836406f9479e26bb2dc27439df3efdae3c298edd1ea781dcb3ac7a7baae542 directory.

If we look at repositories, we see that "busybox:latest' points to that directory:

$ cat docker/repositories  | jq .
{
  "busybox": {
    "latest": "65836406f9479e26bb2dc27439df3efdae3c298edd1ea781dcb3ac7a7baae542"
  }
}

The contents of layer.tar are actually the same as the uncompressed 7c9d20b9b6cda1c58bc4f9d6c401386786f584437abbe87e58910f8a9a15386b.tar.gz layer from the crane tarball:

$ cat docker/65836406f9479e26bb2dc27439df3efdae3c298edd1ea781dcb3ac7a7baae542/layer.tar | sha256sum 
6c0ea40aef9d2795f922f4e8642f0cd9ffb9404e6f3214693a1fd45489f38b44  -

That json file is basically the config but with the id embedded:

$ cat docker/65836406f9479e26bb2dc27439df3efdae3c298edd1ea781dcb3ac7a7baae542/json | jq .
{
  "id": "65836406f9479e26bb2dc27439df3efdae3c298edd1ea781dcb3ac7a7baae542",
  "created": "2019-09-04T19:20:16.230463098Z",
  "container": "e30cd53834b3dfdb989c63cc73f4f31f404c7a6a0c0e9d6b9e3e8451edd72596",
  "container_config": {
    "Hostname": "e30cd53834b3",
    "Domainname": "",
    "User": "",
    "AttachStdin": false,
    "AttachStdout": false,
    "AttachStderr": false,
    "Tty": false,
    "OpenStdin": false,
    "StdinOnce": false,
    "Env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
    ],
    "Cmd": [
      "/bin/sh",
      "-c",
      "#(nop) ",
      "CMD [\"sh\"]"
    ],
    "ArgsEscaped": true,
    "Image": "sha256:758a17a836a4c09586a291c928d1f0561320e252d07c4749e14338daefe84b18",
    "Volumes": null,
    "WorkingDir": "",
    "Entrypoint": null,
    "OnBuild": null,
    "Labels": {}
  },
  "docker_version": "18.06.1-ce",
  "config": {
    "Hostname": "",
    "Domainname": "",
    "User": "",
    "AttachStdin": false,
    "AttachStdout": false,
    "AttachStderr": false,
    "Tty": false,
    "OpenStdin": false,
    "StdinOnce": false,
    "Env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
    ],
    "Cmd": [
      "sh"
    ],
    "ArgsEscaped": true,
    "Image": "sha256:758a17a836a4c09586a291c928d1f0561320e252d07c4749e14338daefe84b18",
    "Volumes": null,
    "WorkingDir": "",
    "Entrypoint": null,
    "OnBuild": null,
    "Labels": null
  },
  "architecture": "amd64",
  "os": "linux"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
2 participants