Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reprounzip-docker] copy extracted DATA directory instead of data.tgz #274

Open
kaczmarj opened this issue Sep 21, 2017 · 17 comments
Open
Labels
C-unpackers/docker Component: The Docker unpacker T-enhancement Type: En enhancement to existing code, or a new feature
Milestone

Comments

@kaczmarj
Copy link

Hello,

Regarding reprounzip docker, you can achieve smaller image sizes by copying over the untarred DATA directory instead of copying and then extracting data.tgz. The delta image size is the size of data.tgz. Is there a reason to have data.tgz inside the image?

Also, all of the COPY instructions can be merged into one, because they copy into the same directory (/). The Dockerfile could look like this:

FROM debian:stretch
COPY busybox DATA rpz-files.list rpzsudo /
RUN chmod +x /busybox /rpzsudo

If this looks OK, I would be more than happy to submit a PR.

Thanks,
Jakub

@remram44 remram44 added C-unpackers/docker Component: The Docker unpacker T-enhancement Type: En enhancement to existing code, or a new feature labels Sep 21, 2017
@remram44
Copy link
Member

Hi and thanks for looking into this!

Basically the semantics I want:

  • Copy the files from the TAR into the image (with UNIX permissions/ownership)
  • Replace existing files with the ones from the image
  • For directories that exist in both, use the UNIX permissions from the TAR
  • If overwriting with a different type (eg extracting a file over a directory, or a directory over a file) use the version from the image

I have been struggling a long time with the extraction of data in the image. I have changed the tar flags multiple times, and there are still issues, like #145.

One set of flags I was using before correctly merged the files from the image with the files from the TAR, but failed if overwriting a directory with a file (as it can happen when unpacking the Fedora tree structure over the Debian one). The current flags (11929b3) don't fail in that case, but tend to remove existing files from directories when extracting files into it.

I actually considered writing the Docker image manually so that I have better control over this. Instead of writing out a Dockerfile and running it, just assembling an image tar or container tar, and loading it with docker load or docker import.

The issue I see with your own approach is that file and directory ownership would not be carried over. Permissions can also be lost if you are doing this on Windows, since you are round-tripping through the Windows file system that doesn't support them.

Indeed it is very unfortunate that the data.tgz still takes unnecessary space as an image layer. I will probably go with writing out a container tar for docker import when I have time, but I am open to all suggestions.

@remram44 remram44 added this to the 1.1 milestone Sep 21, 2017
@kaczmarj kaczmarj changed the title [reprounzip-docker] copy extraced DATA directory instead of data.tgz [reprounzip-docker] copy extracted DATA directory instead of data.tgz Sep 21, 2017
@kaczmarj
Copy link
Author

Thanks for your detailed reply. The COPY instruction should preserve metadata (including permissions?) but I did not consider the effects of round-tripping through the host. With that and your desired semantics in mind, I will test a few possibilities.

On a related note, I am working on an alternative method to minimize Docker containers with neurodocker reprozip-trace. At the moment, that command (1) creates a miniconda environment that has reprozip, (2) runs reprozip trace on an arbitrary number of commands, (3) runs reprozip pack, and (4) copies the pack file onto the host. The user would use reprounzip docker to create a new Docker image from that pack file.

The alternative I am working on is to remove all of the files within the running container that were not caught by the trace, and then docker export that container to squash it into a newly minimized image. In my opinion, this would give the smallest image possible. I might do what reprounzip docker does and install /busybox. I am still in the process of testing this, but how does that sound to you?

@remram44
Copy link
Member

I think the best solution for both of us is to write out a Docker container TAR directly for docker import, you by going through the config file and copying from the old image to the new TAR, and me by extracting from the RPZ.

@kaczmarj
Copy link
Author

That sounds good. I will keep you updated on my progress.

@remram44
Copy link
Member

Also note that being able to trace Docker images is something we are interested in! Tracing without installing ReproZip inside the container should be possible (but not super straightforward).

I do not have time to work on it now unfortunately, as other matters are more pressing (but less fun). However should you find a reliable method to trace Docker images it is definitely something we'll want to support and distribute as part of ReproZip.

@kaczmarj
Copy link
Author

Can you explain a little bit how one could trace inside a container without installing ReproZip? I might be able to work on this.

@remram44
Copy link
Member

The processes in the container's PID namespace also exist as processes on the host, so you can attach to them from the outside. However you would need the application to wait for ReproZip to attach before it starts.

I imagine something like this:

  • Create container
  • Inject a little dummy executable that STOPs (or does PTRACE_TRACEME, like normal ReproZip does) before calling the actual entrypoint
  • Run container (starting with dummy)
  • Attach ReproZip to process 1 of container (dummy), send it CONTINUE signal
  • Trace application as normal, being aware that the filesystem is different (paths used by the container refers to inside the container, not the host where ReproZip is running)

Packing would be a different process, reading files from the original image instead of the filesystem.

There might even be a way to have the ReproZip tracer itself be in a (separate) container, putting it in the same PID namespace as the container we want to trace using docker run --pid=container:othercontainername

@remram44
Copy link
Member

Also note that if using things like docker-machine or "docker native" the "host" is the VM, which would make this a bit annoying (unless ReproZip is also running in a container).

I have most of the code for this and might take a shot at it when I find the time, of course you are welcome to try and make sense of my code 😅

@kaczmarj
Copy link
Author

Thanks, that makes sense to me. Is the code you wrote in a branch in this project?

The Docker documentation for --pid option is really helpful.

@kaczmarj
Copy link
Author

Have you considered using the ADD instruction for data.tgz? That instruction will extract the .tgz automatically. I tested this and the image size delta (between using COPY data.tgz and ADD data.tgz) is the size of data.tgz. But I haven't tested this further, so I do not know what happens to file permissions and whether existing files/directories are overwritten.

@remram44
Copy link
Member

I think ADD didn't behave properly when the permissions for directories differ between the tar and the container, and when the same path is a file and a directory. Also we can't rely on Docker's command behavior staying consistent across versions...

@kaczmarj
Copy link
Author

kaczmarj commented Oct 2, 2017

Hi @remram44 - can you share the code you have to run a trace without having reprozip installed? I'd like to have a go at it.

kaczmarj pushed a commit to kaczmarj/nipype that referenced this issue Oct 2, 2017
The specification for CircleCI 2.0 is stored in `.circleci/config.yml` instead of in `circle.yml`.

Todo:
- Run tests. For now, images are built and pushed, but no tests are run.
- Minimize containers with neurodocker reprozip. This functionality will be updated soon. See discussion in VIDA-NYU/reprozip#274.
@remram44
Copy link
Member

remram44 commented Oct 3, 2017

Hi @kaczmarj, unfortunately I have no code for this yet. This would be a change in the beginning of the tracer (instead of current fork_and_trace()) + some wrapper for the Docker container (to wait for ReproZip to attach).

@kaczmarj
Copy link
Author

kaczmarj commented Oct 3, 2017

I misunderstood you when you said to try to make sense of your code :)

@kaczmarj
Copy link
Author

kaczmarj commented Aug 10, 2018

Using multi-stage builds would fix the original issue (that the .tar.gz file is still counted towards image size after decompression). A pseudo-Dockerfile would look like this:

FROM debian:stretch
COPY data.tar.gz /
RUN tar xzf tar.gz --strip-components 1

RUN rm /data.tar.gz
FROM debian:stretch
COPY --from=0 / /

the new pieces to the dockerfile would be those last three lines. it will start a build from scratch and copy root over. this will copy all of the necessary files but the data.tar.gz file will no longer count towards the build.

if this looks ok with you @remram44 i would be happy to submit a pr

@remram44
Copy link
Member

I will look into it.

Do you think it would be possible to use a "reprounzip" docker image as the other container, so that anyone can build a Docker image from an RPZ like this? (apologies for the syntax, I have never used multi-stage builds, hopefully you get the idea)

FROM reprounzip
COPY experiment.rpz /data
RUN reprounzip-docker-setup /data/experiment.rpz /output

FROM scratch
COPY --from=0 /output /

@kaczmarj
Copy link
Author

yes i think that would work. bootstrapping scratch is also a good idea, but the contents of /output will probably have to be present in the root directory. maybe busybox can be copied as well, and the contents of /output can be moved up to / with busybox mv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-unpackers/docker Component: The Docker unpacker T-enhancement Type: En enhancement to existing code, or a new feature
Projects
None yet
Development

No branches or pull requests

2 participants