Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify dock_from_desc? #34

Closed
riccardoporreca opened this issue May 16, 2019 · 5 comments
Closed

Simplify dock_from_desc? #34

riccardoporreca opened this issue May 16, 2019 · 5 comments

Comments

@riccardoporreca
Copy link

Is there a reason why we are installing (some of) the package dependencies with an explicit remotes::install_cran()

x$RUN(paste0("R -e 'remotes::install_cran(\"", imp[i], "\")'"))

instead of getting this through the existing remotes::install_local()?
x$RUN("R -e 'remotes::install_local(\"/app.tar.gz\")'")

Is it to get as much work as possible done before the COPY to leverage on layerization?

Beyond this, is there a big benefit of having individual RUN instructions for installing each package separately instead of using vectorized installation in a single layer? As soon as the package dependencies change we can be arbitrarily lucky or unlucky with such intermediate layers.

Also, be it individual or vectorized, I am also wondering if there is a specific reason to favor (CRAN) package installation via remotes::install_cran() over install.packages() or the neat install2.r.

Sorry for the many questions 😄 , I find golem very interesting and I am trying to bring up comments based on my experience / best practices and to consolidate these at the same time. Happy to discuss and contribute!

@ColinFay
Copy link
Member

👋,

Thanks a lot for this comment! I'm very glad to hear you like {golem}, and thanks for providing feedbacks.

TBH this do reflects an internal discussion we had about what is the correct way to build a Dockerfile for a Shiny App. I think the answer very much depends on where you are in the DevOps cycle of the Shiny app. The way it is built today is more dev-oriented — by that I mean that it is built with iteration and image rebuilding in mind, so it provides a way to make rebuilding your container for a new version quicker, making it easier to iterate in a dev and sandboxing context.

After that little piece of context, let me split my answer in several points.

Why do we install all dependencies before?

You've guessed right — basically the way the Dockerfile is built today reflects a structure you would like to use if you need to build and rebuild your container several times, changing the Dockerfile between each iteration.

We're having a series of RUN install_cran before RUN install_local because this config:

FROM rocker/r-ver
COPY app_*.tar.gz /app.tar.gz
RUN install_local(/app.tar.gz)

will mean that you will have to reinstall the whole set of dependencies every time you come with a new version of the application. Which can happen a lot of times during dev :)

On the other hand

FROM rocker/r-ver
RUN remotes::install_cran(dep1)
RUN remotes::install_cran(dep2)
[...]
COPY app_*.tar.gz /app.tar.gz
RUN install_local(/app.tar.gz)

Will only rerun from COPY there, so you won't have to reinstall all the dependencies

The downside : Docker can limit to 127 layers (see there for more details), but it seems to be kernel dependent. And to be honest, I'm not sure there will be that much golem apps with more than 100 listed dependencies :)

I agree though that it's a dev format. So maybe we should have two flavours of Dockerfile ? (one dev, one ops ?), with the production format being just the copy & the install_local()

Why that much RUN statements instead of vectorizing things?

For the same reason as before. Adding a dependency to you package doesn't recompile the whole Dockerfile. If we're still with the previous Dockerfile:

FROM rocker/r-ver
RUN remotes::install_cran(dep1)
RUN remotes::install_cran(dep2)
RUN remotes::install_cran(dep3)
[...]
COPY app_*.tar.gz /app.tar.gz
RUN install_local(/app.tar.gz)

will only rebuild from dep3.

Well... provided you've didn't usethis::use_tidy_description(), which would reorder the dependency list alphabetically, but you get the idea 😄

Can you elaborate a little bit more about what you have in mind with:

As soon as the package dependencies change we can be arbitrarily lucky or unlucky with such intermediate layers.

There's a chance I'm missing something there :)

Why using remotes::install_cran()?

remotes::install_cran() doesn't reinstall a package that is already installed on the machine while base::install_package() does.

e.g.

FROM rocker/tidyverse
RUN R -e "remotes::install_cran('dplyr')" 

Won't reinstall {dplyr} while install.packages() would have.

Why using remotes::install_local()?

remotes::install_local() also installs the Remotes dependencies from DESCRIPTION so you can easily link to a GitHub package, or any other place supported by {remotes} (Bioconductor, GitLab, bitbucket,...) : https://remotes.r-lib.org/articles/dependencies.html

Why do we use {remotes} at all?

I definitely do agree that {littler} is cleaner in scripts. The reason for {remotes} and not install2.r is the same as the one for remotes::install_* over install.packages()install2.r reinstalls the packages that are already there on the image, so basically:

FROM rocker/tidyverse
RUN install2.r -e dplyr

will reinstall {dplyr} when it is not required.

Sidenote: It's possible that future versions will rely on {pak}.

Simplify or complexify?

Maybe the function should not be simplified, but on the other hand become a little bit more "complex" — with the possibility to create a Dockerfile for dev, one other for ops. What's your opinion on that?


Again, thanks lot for your comment. It's kind of hard to find a solution that fits every context, and we tried to provide one solution so I'm very eager to hear feedbacks about the way we do these things.

I'll be very happy to hear your thoughts about all this.

@riccardoporreca
Copy link
Author

riccardoporreca commented May 17, 2019

Thanks a lot @ColinFay for the great and complete reply, this is exactly the kind of discussion and I was aiming to trigger, getting the insights and reasoning behind the approach used in {golem}

I believe {golem} is a great way to attract the attention of people around the topic of packaging Shiny apps and their containerized deployment, and can be an excellent place to start discussing about such topics.

Let me share some more thoughts following on your points.

Individual RUN for package installation before install_local()

Indeed, this approach greatly helps in a dev-oriented setup, which is anyways where one typically starts from, and as such it perfectly fits with {golem} as a great starting point for new users.
Supporting an ops-oriented flavor can be probably a good idea.

In general, the effectiveness of a single-package-layer approach depends on where the change in dependencies occurs. If a dependency is added / removed "at the bottom" in the DESCRIPTION file, this is what I referred to as "lucky". On the other hand, addition / removal "at the top" is the "unlucky" case. And indeed, I was thinking about tidy alphabetical sorting 😄, where it can really be a matter of luck (especially with shiny / tidyverse being towards the end of the alphabet).

I think this considerations can lead to recommending an educated ordering of dependencies the package developer should take into account, following e.g. two principles:

  1. Major (also in terms of size and dependency tree), stable dependencies should be listed at the top.
  2. New dependencies can go at the bottom, unless they are part of a big refactoring have a big importance going forward.

The same principles applies for a manual, educated construction of the Dockerfile, where the 1. dependencies would get installed in layers before COPY and those in 2. even installed via install_local(). A natural choice for group 1. is shiny and dplyr/tidyveres, which probably take most of the installation time and space in a typical app.

These considerations are course very application-specific, so we cannot really expect {golem} to do all this for us.

remotes::install_cran()

Indeed, we don't want to waste the effort of re-installating packages.

About {littler}, note you can use option --skipinstalled

FROM rocker/tidyverse
RUN install2.r --error --skipinstalled dplyr

Thanks for the pointer to {pak}, I totally missed this package!

Simplify or complexify?

I landed to this issue with simplification in mind, mainly driven by ops-related considerations. Indeed, the dev-related considerations coming from this discussion call for the complexity of the current implementation, in my view more important than ops in the context of {golem} at this stage. Supporting the two worlds (or hemispheres) with a more ops-y Dockerfile can be definitely a good direction!


Thanks again!

@cderv
Copy link

cderv commented May 18, 2019

Great Discussion ! It may not be the perfect place but as you started the discussion, I chime in to share some thoughts.

About deployment with docker, specifically for dev workflow, I prefer the approach of using a package management tool (before {packrat} now {renv}) to separate the package management from the docker and gets closer to R project. This allows also deployment without docker and ease the setup for a new dev environment (for a new developer in the team or change of computer). I find it a more general approach.

Without docker, it would mean just restoring the project dependencies using R code.

With docker, It means that a volume needs to be mounted for the project library where the package would be installed once and every new deployment would not reinstall packages already installed. Very quick deployment then. If the same deployment environment is used for many containers (it could be in testing setup), then the cache mechanism of such tools would accelerate the process for all deployment.

The main advantage I see to this is having fixed version of package is possible. We use this at work - it allows to work in teams with the developer, the IT teams and the dataops who will help setup the deployment bridging R and DevOps stuffs.

The approach without a project package management tool can be interested too. (always testing with last package versions, simplicity) and be coupled with a project repository (custom RAN) approach that would help fixed project version of packages. This is something used at works too using a NEXUS repository) but not for R packages.

@ColinFay if you are curious about what I have tried so far for docker deployment and shiny apps using what I described, I can work on an example on how it would work with Golem.

Cheers.

@ColinFay
Copy link
Member

ColinFay commented May 20, 2019

@cderv thanks for the input! That's very enlightening.

I'd be very happy to read what you have in mind for a {golem} specific example — so that we potentially add that to the package :)

I was also thinking about how we could share the {pak} cache repository with the docker image, so that we can make the installation quicker. Though that wouldn't be project specific but would imply that all the packages on the user computer have been installed with {pak}.

I wonder also if this could be an option to the dockerfile creation, something like with_packrat, to set to TRUE or FALSE. Setting to TRUE will trigger a packrat creation.

@riccardoporreca ok got it :) indeed the alphabetical order can make one lucky (or not), and I'm not sure how to tackle that issue for now 🤔

What I regularly do is what you're suggesting : building a "stable" docker image with the list of dependencies that I know won't change that much, and a second one whichFROM the first, so rebuilding the whole image for this one is really quicker.

On one hand if we switch to simply install_local we reduce the amount of layers, but that means that we need to rebuild everything during dev, but that fits for ops.

Thanks for pointing the littler command. I didn't know you could use the --skipinstalled flag!

For now though I think the next effort will be to integrate {pak}{pak} does parallel download of dependencies & install, which could make the build faster.

I think we should indeed split the dockerfile creation into several options (at least one for dev, one for ops).

The choice of the image infrastructure being a specific to each project, I'm wondering what's the right overall approach to this. I can already see four combination : for dev / for ops & with / without environment management tools.

One other thing worth considering is the date : currently, the rocker dockerfiles are set up to a specific date, linked to an MRAN repository. So for example if you have an older version of R but recent package, you need to manually change the date of the Docker image.

I wonder if this should be the default behaviour in the current context (deploying the app) and if we should let the user change that date with a function parameter 🤔

@ColinFay
Copy link
Member

Hey,

All docker-related functions have moved to {dockerfiler}.

We'll keep track of the issues here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants