Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] reproducibility #170

Open
minrk opened this issue Dec 18, 2017 · 5 comments
Open

[Discussion] reproducibility #170

minrk opened this issue Dec 18, 2017 · 5 comments
Labels
enhancement needs: discussion reproducibility enabling scientific reproducibility with r2d

Comments

@minrk
Copy link
Member

minrk commented Dec 18, 2017

Discussion issue for general topics of reproducibility and what's in and out of scope for repo2docker (and Binder).

We currently have a tension between our scientific goal of reproducibility and the maintenance goal of keeping everything up to date. We have the same issue that everyone who pursues reproducibility has, which is specifying the environment as strictly as necessary (so it's correct), but no stricter (so it stays useful). Conservative approaches are to use overly-specified environments (e.g. pip freeze / conda env export), which we should make sure to support well and document for the more reproducibility-minded users.

A user who wants to ensure a truly reproducible build must:

  • use a pip freeze or conda env export-produced environment specification
  • pin the Python version (for pip, already done above for conda)
  • pin the distro/base image
  • probably pin repo2docker itself (easy for manual use cases, not available on Binder)

Right now, the only truly reproducible builds available on Binder are custom Dockerfiles, which is something I want fewer people to use, not more. But we currently have no answer for reproducibility with any other builders, as there is no way for users to be sufficiently strict about the environment.

@choldgraf
Copy link
Member

I think this is a super important topic, especially when it comes to the publishing world. This is related to #93 though that's a more specific topic.

@yuvipanda
Copy link
Collaborator

I really like the idea of pinning repo2docker versions, which seem like the easiest (and maybe only?) solution to this problem. If we can guarantee that a properly prepared repo will always produce the same Dockerfile (rather than image, since we can not guarantee that) for any given version of repo2docker, I think that's good enough no?

We might have to write version shims to maintain binderhub <-> repo2docker compatibility, but that seems not entirely too difficult. We could switch from passing in commandline arguments to using something more complex and versionable too if we want.

@yuvipanda
Copy link
Collaborator

Thinking more on this, there's three things we should try to allow users to pin:

  1. Versions of languages (Python, R, Julia, etc)
  2. Versions of libraries for the language installed by the language specific package manager (conda, pip, whatever R uses, etc)
  3. Versions of packages installed by the system package manager (apt)

We could / should use runtime.txt for (1), recommend pinning for (2), and make apt.yaml for (3). That's a good start I think, and gives us lots of low hanging fruit to work with...

@betatim
Copy link
Member

betatim commented Jan 12, 2018

More thoughts on reproducibility: freeze conda build numbers as well or not

@manics
Copy link
Member

manics commented May 17, 2021

Definitely an important discussion, but probably something we'll need to engage with the community on https://discourse.jupyter.org/ especially if at some point we need to make major upgrades to R2D (e.g. the base image?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement needs: discussion reproducibility enabling scientific reproducibility with r2d
Projects
None yet
Development

No branches or pull requests

7 participants