The need for speed #707

betatim · 2019-06-19T07:41:14Z

This issue is about collecting ideas that could make the images produced by repo2docker smaller, faster to push/pull or faster to build.

I envision this thread to be a meta thread with lots of ideas that then either turn into PRs or get marked as "already done" or "not possible". This way we can use it as an entry point to finding other related issues or PRs.

Why make images smaller and faster to build? From our own experience and the first results from the binder user survey it is clear that faster builds and faster launches is something people really care about.

Smaller images

A few ideas via https://jcrist.github.io/conda-docker-tips.html

don't use MKL: we already use conda-forge as default -> openblas is our default, no more gains possible
run conda clean -afy: already implemented
--freeze-installed: not currently used, unsure if it would help, worth trying
remove additional unnecessary files: we should do this
use a smaller base image: not applicable as the ubuntu base we use should be present on all nodes in a BinderHub cluster -> this is "free" as it doesn't need pulling or pushing. Double check this is actually true

Reordering build steps for faster builds

Right now we have to rebuild the whole image (from the point onwards where we copy in the contents of the repo) even if the user only changes a typo in the README. The reasoning behind this is that a requirements.txt could contain something like -e . which leads to the setup.py in the repo being executed. This in turn means the setup process could be executing "anything and depends on everything in the repo". There is no way of knowing that the one character change in the README won't change the build result.

However I think this is a fringe case and the common case is that people only install packages from PyPI and don't depend on the rest of the repository. How can we make it so this common case is faster and still get the rare case right?

The obvious way to speed up builds and rebuilds is to copy only the requirements.txt into the repo, run the install step and then copy over the rest of the repository. This way a change in the README won't break the docker layer cache, which means rebuilds are fast.

One thing we could try is to copy the requirements.txt early, run pip install -r requirements.txt wrapped in a "if this fails just continue" block, then copy the full repo, rerun pip install -r requirements.txt which will either be a no-op (if the early run succeeded) or will clean up the breakage from the first run.

We invented install.R so we could declare it a mistake to rely on anything in the repository. This means we can copy it over early. This would save a large amount of user pain because R builds are some of the slowest builds we have. (see #716)

For environment.yml I am not sure if you can install things from a local directory or not. In either case we could treat it like the requirements.txt case (try, ignore errors, retry).

Overhead from docker layers

One thing I was wondering is if an image of the same size and content that has 100 layers (say one each per file added) has more pull overhead than one that consists of only one layer. From watching docker pull it seems there are various steps that happen after a layer has been pulled (checksum, unpacking) that could be saved by reducing the number of layers.

The text was updated successfully, but these errors were encountered:

betatim · 2019-06-19T19:01:42Z

We could look at our launch stats and then integrate a "trigger build on binder" into their CI setup when they merge something to master.

betatim · 2019-06-20T21:06:16Z

Some more ideas on where to potentially shave off some megabytes: https://simonwillison.net/2018/Nov/19/smaller-python-docker-images/

At some point we should check if we have gotten to the point were we are now chasing single digit MB improvements and stop.

saulshanabrook · 2019-06-20T21:12:24Z

One thing we could try is to copy the requirements.txt early, run pip install -r requirements.txt wrapped in a "if this fails just continue" block, then copy the full repo, rerun pip install -r requirements.txt which will either be a no-op (if the early run succeeded) or will clean up the breakage from the first run.

This makes sense to me.

betatim · 2019-06-20T21:50:02Z

This makes sense to me.

Want to give that a go in a experimental PR to give us something to try out? :)

manics · 2019-06-21T11:01:24Z

If your image is on Docker Hub Microbadger gives a nice visualisation of the size of each layer, e.g. https://microbadger.com/images/jupyter/tensorflow-notebook

betatim · 2019-06-21T13:11:02Z

Thanks for the microbadger link! I've been using https://github.com/wagoodman/dive to look at local images and poke around the filesystem to find files that could be removed.

psychemedia · 2019-06-21T21:51:39Z

This seems like voodoo magic to me - https://github.com/docker-slim/docker-slim - but I pass it on anyway, just in case...

Just in case anyone stumbles across this thread looking for other tools, https://www.fromlatest.io/#/ will have a go at profiling a Dockerfile and point out ways of optimising it if it thinks it can...

TomAugspurger · 2019-06-25T19:38:49Z

What's the policy on which packages are included in the base environment.yml? For example, nbconvnert (via conda, not pip IIRC) pulls in pandoc and pandoc-citeproc, which are ~85Mb extracted (compared to libpython which is 16Mb).

On nbconvert specifically, is pandoc required? Do people using pip to manage their requirements complain that it's not present by default? conda-forge/nbconvert-feedstock#24 is the nbconvert-feedstock issue. Perhaps we could split that into an nbconvert-core feedstock, which doesn't have pandoc, and have nbconvert depend on it.

betatim · 2019-06-25T19:44:13Z

Investigated if we can use the trick from #716 (pre-assemble) with environment.yml and the answer is no. An envrionment.yml like:

name: stats
dependencies:
  - numpy
  - pandas
  - pip:
    - -r requirements.txt

is legal. This means there could be users who rely on the contents of the repo being available.

A potential way out is to try and install the environment.yml before copying over the repo (and speculating that most users don't rely on this feature), on failure rolling back the transaction and retrying after the copy. Something like:

COPY environment.yml environment.yml
RUN conda env create -f environment.yml || true
COPY . /home/jovyan
RUN conda env create -f environment.yml

this assumes the second conda run is "free" or a no-op if the first succeeded (might not be true).

betatim · 2019-06-25T19:47:47Z

We have nbconvert because it is a dependency of the notebook. I think we have discussed (in a previous issue) if we could make the notebook not show some of the "Save as" options that become unavailable if you install nbconvert-light (aka nbconvert-only-to-html). I think I'd be -1 on having the UI elements available but the functionality being broken because we install a partial nbconvert.

The nbconvert files probably exist in the layer of the docker image which is shared amongst all images built by the same version of repo2docker. So at least in the context of mybinder.org they are "free" (or at least somewhat less expensive). Users requesting different versions of Python makes this image layer sharing less clear cut :-/

saulshanabrook · 2019-06-25T19:49:03Z

Have you tried using the CNCF Buildpack project? It seems aligned in that it lets you build different parts of your app separately and then combine the layers without having to rebuild everything:

betatim · 2019-06-25T19:56:13Z

Have you tried using the CNCF Buildpack project?

We've discussed it but I can't refind the issue. "buildpack" turns out to be a terrible search word in r2d land :-/

jchesterpivotal · 2019-06-25T20:22:16Z

@betatim from what I can tell, Cloud Native Buildpacks were touched on here: #487 (comment)

(Edit: more accurately, the pack CLI was touched on, but note that it represents only part of the buildpacks.io effort)

From an outsider's squinting perspective, I feel like CNBs are trying to solve the same class of problems as r2d: efficient, automated, composable builds.

saulshanabrook · 2019-06-25T21:00:24Z

I am chatting with buildpack folks on https://slack.buildpacks.io/ if anyone is interested (cc @yuvipanda)

yuvipanda · 2019-06-25T22:39:01Z

repo2docker used to be based off s2i, which is (IIRC) very similar in architecture to Cloud Native BuildPacks. We moved away from it for reasons listed in http://words.yuvi.in/post/why-not-s2i/.

Maybe things are different now? I'd say that should be in a different issue than here, though. This issue should be focused on performance improvements we can make with what we have right now.

jchesterpivotal · 2019-06-26T14:37:16Z

By way of warning, what follows is hilariously biased: I've several times worked on two generations of buildpack technology over the past 5 years. Pride makes me defensive.

As it was related to me by a Red Hatter I asked, s2i was created largely because the previous generations of buildpack lifecycles from Heroku (v2a) and Cloud Foundry (v2b) were optimised to a rootfs+tarball target (Heroku's term is "slug", Cloud Foundry's is "droplet"). That was considered unsuitable for OpenShift v3, which was an image-centric architecture.

Whereas Heroku and Cloud Foundry would meet you at code and hid the underlying container infrastructure, OpenShift would meet you at the image, so the latter (this is a personal opinion) had a business need for something like buildpacks to reduce the convenience gap.

But s2i never really found a home outside of OpenShift, while buildpacks have flourished in two massive, independent but genetically-related ecosystems.

Critically, the emergence of the v2 registry API enables features (particularly layer rebasing) that were previously impossible. In addition Google's Container Tools team developed and maintain the google-gocontainerregistry library which allows us to perform construction and rebasing operations with or without the docker daemon. The design of CNBs takes full advantage of both of these advances.

By way of speed improvements: We have observed some Java rebuilds drop from minutes to milliseconds. We expect large-cluster rollouts to drop from dozens of hours to potentially minutes.

Edit: I should add, your reasons for moving off s2i would apply to v2a and v2b buildpack lifecycles as well. One of the motivating problems faced by both Pivotal and Heroku has been exactly this sort of combinatorial explosion; CNBs are designed to make it possible to more easily compose buildpacks developed independently of one another.

yuvipanda · 2019-06-26T16:46:57Z

Thank you for chiming in, @jchesterpivotal! I very much appreciate the context, and the important new information about v3 of buildpacks :)

CNBs are designed to make it possible to more easily compose buildpacks developed independently of one another.

This gives me hope :) I've opened #720 to continue discussion about buildpacks v3, including a test case that'll help us see if we can base repo2docker off it.

betatim · 2019-07-11T18:03:57Z

Another idea via @gaborbernat at EuroPython: when we do decide that a requirements.txt doesn't need the repo contents to run we could reorder the packages in it before running it to increase (a bit) the chances that two repositories with the same dependencies share a cache layer.

Same goes for environment.yml.

I am not sure if the standard for either file specifies if there is meaning to the order in which packages are listed or not. That is something to check.

psychemedia · 2019-07-15T10:57:32Z

@betatim Things like pypa/pip#5761 suggest that order is arbitrary, although you can use multiple requirements.txt files, executed in order.

To maximise chances of hitting caches, I guess something like this might work?

define some "common" requirements.txt sets from cached requirements components (scipy_reqs.txt, geo_reqs.txt, etc);
define sets corresponding to packages in a user requirements file as a set (user_reqs) ;
do a scipy_reqs.issubset(user_reqs) or user_reqs.issuperset(scipy_reqs) to find whether you have a cached set of packages available and install those if so;
identify remaining non-cached requirements: user_reqs.difference(scipy_reqs, geo_reqs).

Of course, if folk pin dependencies, this could make finding hits / matches much harder...

davidrpugh · 2019-07-21T03:44:57Z

I came across this lines in repo2docker/repo2docker/buildpacks/base.py

# Copy and chown stuff. This doubles the size of the repo, because
# you can't actually copy as USER, only as root! Thanks, Docker!
USER root
COPY src/ ${REPO_DIR}
RUN chown -R ${NB_USER}:${NB_USER} ${REPO_DIR}

I do not understand exactly why this would double the size of the repo, but you can now copy as user! The following would copy as user.

COPY --chown=${NB_UID}:${NB_GID} src/ ${REPO_DIR}

I have used the copy as user and seems to work as advertised. I can open a PR if this seems promising...

betatim · 2019-07-21T15:31:54Z

@davidrpugh #164 is the issue about using (or not) --chown. The recent posts have gone a bit off topic but it is still the best place to go to find context.

betatim · 2019-07-24T05:07:09Z

In #743 #718 we added support for installing package before we copy over the contents of the repository. This means that if you are re-building a repo where only the README.md has changed the package install step should be cache (and fast)! Currently we give install.R, environment.yml and requirements.txt this special treatment. The best way to see this in action is to test it locally. (On mybinder.org we need to do some work to increase the chances that a rebuild is assigned to the same node as the original build)

I think except for the COPY --chown most ideas in this thread have either been tried and merged or rejected so we can close this thread up. Thanks for all the ideas!

betatim mentioned this issue Jun 20, 2019

[MRG] Remove more files during the image build to slimdown the image #709

Merged

This was referenced Jun 25, 2019

[MRG] Install APT packages before copying the repo contents #716

Merged

clean out package dir to reduce image size? pangeo-data/pangeo-stacks#22

Open

yuvipanda mentioned this issue Jun 26, 2019

Explore CNCF v3 buildpacks #720

Open

yuvipanda mentioned this issue Jul 7, 2019

Add automated test of cache re-use when building repositories #733

Open

betatim mentioned this issue Jul 29, 2019

releasing repo2docker 0.10.0 #756

Closed

4 tasks

betatim closed this as completed Aug 16, 2019

psychemedia mentioned this issue Oct 24, 2019

Unaddressed issues nuest/ten-simple-rules-dockerfiles#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The need for speed #707

The need for speed #707

betatim commented Jun 19, 2019 •

edited

Loading

betatim commented Jun 19, 2019

betatim commented Jun 20, 2019

saulshanabrook commented Jun 20, 2019

betatim commented Jun 20, 2019

manics commented Jun 21, 2019

betatim commented Jun 21, 2019

psychemedia commented Jun 21, 2019

TomAugspurger commented Jun 25, 2019

betatim commented Jun 25, 2019

betatim commented Jun 25, 2019 •

edited

Loading

saulshanabrook commented Jun 25, 2019

betatim commented Jun 25, 2019

jchesterpivotal commented Jun 25, 2019 •

edited

Loading

saulshanabrook commented Jun 25, 2019

yuvipanda commented Jun 25, 2019

jchesterpivotal commented Jun 26, 2019 •

edited

Loading

yuvipanda commented Jun 26, 2019

betatim commented Jul 11, 2019

psychemedia commented Jul 15, 2019

davidrpugh commented Jul 21, 2019 •

edited

Loading

betatim commented Jul 21, 2019

betatim commented Jul 24, 2019

The need for speed #707

The need for speed #707

Comments

betatim commented Jun 19, 2019 • edited Loading

Smaller images

Reordering build steps for faster builds

Overhead from docker layers

betatim commented Jun 19, 2019

betatim commented Jun 20, 2019

saulshanabrook commented Jun 20, 2019

betatim commented Jun 20, 2019

manics commented Jun 21, 2019

betatim commented Jun 21, 2019

psychemedia commented Jun 21, 2019

TomAugspurger commented Jun 25, 2019

betatim commented Jun 25, 2019

betatim commented Jun 25, 2019 • edited Loading

saulshanabrook commented Jun 25, 2019

betatim commented Jun 25, 2019

jchesterpivotal commented Jun 25, 2019 • edited Loading

saulshanabrook commented Jun 25, 2019

yuvipanda commented Jun 25, 2019

jchesterpivotal commented Jun 26, 2019 • edited Loading

yuvipanda commented Jun 26, 2019

betatim commented Jul 11, 2019

psychemedia commented Jul 15, 2019

davidrpugh commented Jul 21, 2019 • edited Loading

betatim commented Jul 21, 2019

betatim commented Jul 24, 2019

betatim commented Jun 19, 2019 •

edited

Loading

betatim commented Jun 25, 2019 •

edited

Loading

jchesterpivotal commented Jun 25, 2019 •

edited

Loading

jchesterpivotal commented Jun 26, 2019 •

edited

Loading

davidrpugh commented Jul 21, 2019 •

edited

Loading