[reprounzip-docker] copy extracted DATA directory instead of data.tgz #274

kaczmarj · 2017-09-21T15:48:15Z

Hello,

Regarding reprounzip docker, you can achieve smaller image sizes by copying over the untarred DATA directory instead of copying and then extracting data.tgz. The delta image size is the size of data.tgz. Is there a reason to have data.tgz inside the image?

Also, all of the COPY instructions can be merged into one, because they copy into the same directory (/). The Dockerfile could look like this:

FROM debian:stretch
COPY busybox DATA rpz-files.list rpzsudo /
RUN chmod +x /busybox /rpzsudo

If this looks OK, I would be more than happy to submit a PR.

Thanks,
Jakub

The text was updated successfully, but these errors were encountered:

remram44 · 2017-09-21T16:33:20Z

Hi and thanks for looking into this!

Basically the semantics I want:

Copy the files from the TAR into the image (with UNIX permissions/ownership)
Replace existing files with the ones from the image
For directories that exist in both, use the UNIX permissions from the TAR
If overwriting with a different type (eg extracting a file over a directory, or a directory over a file) use the version from the image

I have been struggling a long time with the extraction of data in the image. I have changed the tar flags multiple times, and there are still issues, like #145.

One set of flags I was using before correctly merged the files from the image with the files from the TAR, but failed if overwriting a directory with a file (as it can happen when unpacking the Fedora tree structure over the Debian one). The current flags (11929b3) don't fail in that case, but tend to remove existing files from directories when extracting files into it.

I actually considered writing the Docker image manually so that I have better control over this. Instead of writing out a Dockerfile and running it, just assembling an image tar or container tar, and loading it with docker load or docker import.

The issue I see with your own approach is that file and directory ownership would not be carried over. Permissions can also be lost if you are doing this on Windows, since you are round-tripping through the Windows file system that doesn't support them.

Indeed it is very unfortunate that the data.tgz still takes unnecessary space as an image layer. I will probably go with writing out a container tar for docker import when I have time, but I am open to all suggestions.

kaczmarj · 2017-09-21T19:15:27Z

Thanks for your detailed reply. The COPY instruction should preserve metadata (including permissions?) but I did not consider the effects of round-tripping through the host. With that and your desired semantics in mind, I will test a few possibilities.

On a related note, I am working on an alternative method to minimize Docker containers with neurodocker reprozip-trace. At the moment, that command (1) creates a miniconda environment that has reprozip, (2) runs reprozip trace on an arbitrary number of commands, (3) runs reprozip pack, and (4) copies the pack file onto the host. The user would use reprounzip docker to create a new Docker image from that pack file.

The alternative I am working on is to remove all of the files within the running container that were not caught by the trace, and then docker export that container to squash it into a newly minimized image. In my opinion, this would give the smallest image possible. I might do what reprounzip docker does and install /busybox. I am still in the process of testing this, but how does that sound to you?

remram44 · 2017-09-21T19:24:24Z

I think the best solution for both of us is to write out a Docker container TAR directly for docker import, you by going through the config file and copying from the old image to the new TAR, and me by extracting from the RPZ.

kaczmarj · 2017-09-21T19:25:38Z

That sounds good. I will keep you updated on my progress.

remram44 · 2017-09-21T19:29:28Z

Also note that being able to trace Docker images is something we are interested in! Tracing without installing ReproZip inside the container should be possible (but not super straightforward).

I do not have time to work on it now unfortunately, as other matters are more pressing (but less fun). However should you find a reliable method to trace Docker images it is definitely something we'll want to support and distribute as part of ReproZip.

kaczmarj · 2017-09-22T13:35:21Z

Can you explain a little bit how one could trace inside a container without installing ReproZip? I might be able to work on this.

remram44 · 2017-09-22T14:43:52Z

The processes in the container's PID namespace also exist as processes on the host, so you can attach to them from the outside. However you would need the application to wait for ReproZip to attach before it starts.

I imagine something like this:

Create container
Inject a little dummy executable that STOPs (or does PTRACE_TRACEME, like normal ReproZip does) before calling the actual entrypoint
Run container (starting with dummy)
Attach ReproZip to process 1 of container (dummy), send it CONTINUE signal
Trace application as normal, being aware that the filesystem is different (paths used by the container refers to inside the container, not the host where ReproZip is running)

Packing would be a different process, reading files from the original image instead of the filesystem.

There might even be a way to have the ReproZip tracer itself be in a (separate) container, putting it in the same PID namespace as the container we want to trace using docker run --pid=container:othercontainername

remram44 · 2017-09-22T14:48:53Z

Also note that if using things like docker-machine or "docker native" the "host" is the VM, which would make this a bit annoying (unless ReproZip is also running in a container).

I have most of the code for this and might take a shot at it when I find the time, of course you are welcome to try and make sense of my code 😅

kaczmarj · 2017-09-23T19:12:40Z

Thanks, that makes sense to me. Is the code you wrote in a branch in this project?

The Docker documentation for --pid option is really helpful.

kaczmarj · 2017-09-25T13:00:38Z

Have you considered using the ADD instruction for data.tgz? That instruction will extract the .tgz automatically. I tested this and the image size delta (between using COPY data.tgz and ADD data.tgz) is the size of data.tgz. But I haven't tested this further, so I do not know what happens to file permissions and whether existing files/directories are overwritten.

remram44 · 2017-09-25T14:13:30Z

I think ADD didn't behave properly when the permissions for directories differ between the tar and the container, and when the same path is a file and a directory. Also we can't rely on Docker's command behavior staying consistent across versions...

kaczmarj · 2017-10-02T20:22:46Z

Hi @remram44 - can you share the code you have to run a trace without having reprozip installed? I'd like to have a go at it.

The specification for CircleCI 2.0 is stored in `.circleci/config.yml` instead of in `circle.yml`. Todo: - Run tests. For now, images are built and pushed, but no tests are run. - Minimize containers with neurodocker reprozip. This functionality will be updated soon. See discussion in VIDA-NYU/reprozip#274.

remram44 · 2017-10-03T15:13:48Z

Hi @kaczmarj, unfortunately I have no code for this yet. This would be a change in the beginning of the tracer (instead of current fork_and_trace()) + some wrapper for the Docker container (to wait for ReproZip to attach).

kaczmarj · 2017-10-03T15:15:14Z

I misunderstood you when you said to try to make sense of your code :)

kaczmarj · 2018-08-10T12:50:00Z

Using multi-stage builds would fix the original issue (that the .tar.gz file is still counted towards image size after decompression). A pseudo-Dockerfile would look like this:

FROM debian:stretch
COPY data.tar.gz /
RUN tar xzf tar.gz --strip-components 1

RUN rm /data.tar.gz
FROM debian:stretch
COPY --from=0 / /

the new pieces to the dockerfile would be those last three lines. it will start a build from scratch and copy root over. this will copy all of the necessary files but the data.tar.gz file will no longer count towards the build.

if this looks ok with you @remram44 i would be happy to submit a pr

remram44 · 2018-08-12T23:30:05Z

I will look into it.

Do you think it would be possible to use a "reprounzip" docker image as the other container, so that anyone can build a Docker image from an RPZ like this? (apologies for the syntax, I have never used multi-stage builds, hopefully you get the idea)

FROM reprounzip
COPY experiment.rpz /data
RUN reprounzip-docker-setup /data/experiment.rpz /output

FROM scratch
COPY --from=0 /output /

kaczmarj · 2018-08-13T21:01:48Z

yes i think that would work. bootstrapping scratch is also a good idea, but the contents of /output will probably have to be present in the root directory. maybe busybox can be copied as well, and the contents of /output can be moved up to / with busybox mv.

remram44 added C-unpackers/docker Component: The Docker unpacker T-enhancement Type: En enhancement to existing code, or a new feature labels Sep 21, 2017

remram44 added this to the 1.1 milestone Sep 21, 2017

kaczmarj changed the title ~~[reprounzip-docker] copy extraced DATA directory instead of data.tgz~~ [reprounzip-docker] copy extracted DATA directory instead of data.tgz Sep 21, 2017

kaczmarj mentioned this issue Oct 2, 2017

ENH: generate Dockerfiles with neurodocker + migrate to CircleCI 2.0 nipy/nipype#2202

Merged

7 tasks

remram44 mentioned this issue Jul 1, 2018

Try and make a Docker image TAR from reprounzip without talking to Docker #307

Open

remram44 mentioned this issue Apr 15, 2021

Limit data movement during build VIDA-NYU/reproserver#42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reprounzip-docker] copy extracted DATA directory instead of data.tgz #274

[reprounzip-docker] copy extracted DATA directory instead of data.tgz #274

kaczmarj commented Sep 21, 2017

remram44 commented Sep 21, 2017

kaczmarj commented Sep 21, 2017

remram44 commented Sep 21, 2017

kaczmarj commented Sep 21, 2017

remram44 commented Sep 21, 2017

kaczmarj commented Sep 22, 2017

remram44 commented Sep 22, 2017

remram44 commented Sep 22, 2017

kaczmarj commented Sep 23, 2017

kaczmarj commented Sep 25, 2017

remram44 commented Sep 25, 2017

kaczmarj commented Oct 2, 2017

remram44 commented Oct 3, 2017

kaczmarj commented Oct 3, 2017

kaczmarj commented Aug 10, 2018 •

edited

Loading

remram44 commented Aug 12, 2018

kaczmarj commented Aug 13, 2018

[reprounzip-docker] copy extracted DATA directory instead of data.tgz #274

[reprounzip-docker] copy extracted DATA directory instead of data.tgz #274

Comments

kaczmarj commented Sep 21, 2017

remram44 commented Sep 21, 2017

kaczmarj commented Sep 21, 2017

remram44 commented Sep 21, 2017

kaczmarj commented Sep 21, 2017

remram44 commented Sep 21, 2017

kaczmarj commented Sep 22, 2017

remram44 commented Sep 22, 2017

remram44 commented Sep 22, 2017

kaczmarj commented Sep 23, 2017

kaczmarj commented Sep 25, 2017

remram44 commented Sep 25, 2017

kaczmarj commented Oct 2, 2017

remram44 commented Oct 3, 2017

kaczmarj commented Oct 3, 2017

kaczmarj commented Aug 10, 2018 • edited Loading

remram44 commented Aug 12, 2018

kaczmarj commented Aug 13, 2018

kaczmarj commented Aug 10, 2018 •

edited

Loading