Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Remove more files during the image build to slimdown the image #709

Merged
merged 2 commits into from
Jun 22, 2019

Conversation

betatim
Copy link
Member

@betatim betatim commented Jun 20, 2019

This is a test related to #707: removing static libraries and such.

@betatim betatim force-pushed the slimdown-image branch 2 times, most recently from 2dbc0aa to cdcbc93 Compare June 20, 2019 18:41
@betatim
Copy link
Member Author

betatim commented Jun 20, 2019

Just built https://github.com/betatim/requirements with master and this branch:

master_r2d   latest                 ea3cfcabc70d        30 seconds ago      1.72GB
slim_r2d     latest                 bc5f9a5997f4        12 minutes ago      1.45GB

Which is surprising. 300MB gone like that? Makes me suspect we deleted something that was needed.

@betatim betatim changed the title [WIP] Remove more files during the image build to slimdown the image [MRG] Remove more files during the image build to slimdown the image Jun 21, 2019
@betatim
Copy link
Member Author

betatim commented Jun 21, 2019

I think we should merge this as it already shrinks the image size and has no downsides as far as I can tell. The decrease in coverage seems to be a flake? codecov thinks it is in the R buildpack which is untouched :-/

Copy link
Member

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing my best to review, without actual understanding of anything here but reading up as I go ;D

Things I've learned

.a files

.a files are static libraries typically generated by the archive tool.
--- https://stackoverflow.com/questions/5965171/what-is-a-file-with-extension-a

.pyc files

They contain Python bytecode. Precisely when they're created and updated is IDE/interpreter-dependent (please correct me if I'm wrong here).

Since Python has to be converted to bytecode before it can be interpreted, keeping .pyc files around can help speed up execution, since you don't have to recompile. If there's a .pyc file around, there should be a .py file around with the same name.

Here are some more details:
http://www.network-theory.co.uk/...

You can also force their creation to speed up execution (see the link for details).

Fun fact: You should be able to ship only the .pyc files in your module if you want lightweight/basic obfuscation in your shipped application.
--- https://www.quora.com/What-are-Python-pyc-files-and-how-are-they-used

.js.map

Javascript source maps, for debugging purposes to errors from minified javascript files to something that can be better understood.
--- https://www.html5rocks.com/en/tutorials/developertools/sourcemaps/

Actions to consider


This PR seems reasonable to me, but I know far too little about the details to really tell what the outcome will be in practice.

While I would not be able to do much about the info, I assume it could be relevant to realize how much MB savings comes from clearing .a, .pyc and .js.map respectively. I don't have a guesstimate.

@betatim
Copy link
Member Author

betatim commented Jun 21, 2019

Switched to using -L. Nice find!

I think the summaries you found are spot on. The ideas for which files to delete come via the blog post linked in #707. I think it is safe to delete them and/or some shouldn't be there in the first place (like the .js.map files). I pinged the nteract team about getting a smaller "production" build of nteract_on_jupyter which is one of the largest contributors in terms of .js.map files. They will look into it.

@consideRatio
Copy link
Member

consideRatio commented Jun 21, 2019

Wow excellent resource thanks for linking!

What about the mentioned environment variable PYTHONDONTWRITEBYTECODE=true associated with removing the .pyc files? Do we have such variable set? (Reference for PYTHONDONTWRITEBYTECODE)

If that is managed already, this LGTM!

@betatim
Copy link
Member Author

betatim commented Jun 22, 2019

I think that we should allow Python to write .pyc files when the user is using the pod as that doesn't change the image's size. Before adding it as default to the image, with the intention of preventing installers from creating .pyc files I think we'd have to do more digging to understand if there are ways for install commands to ignore this variable. At least I have a vague memory of having seen log output when installing packages that explicitly refers to "byte compiling packages" and I wonder if that would be prevented by this env variable (which in turn would allow us to remove the find command to delete them.

TL;DR: We'd have to do more digging so let's punt it.

@consideRatio
Copy link
Member

I'm okay with all options and i'm currently clueless if it would be good to set the env var or not as it may or may not slow down the initial run a lot or not for example.

LGTM!

@consideRatio
Copy link
Member

@betatim id like someones help on reviewing/merging Pipfile buildpack, but im afriad everybody is out of time to spend and it has gotten stale...

I put in quite an effort to make it quicker and easier to overview recently, with full understanding of limited time and that it is hard to split up focus on many things by adding another thing to do on the list, id be very appreciative if you could find the time to look it through.

@betatim
Copy link
Member Author

betatim commented Jun 22, 2019

Merging as you approved it. Thanks for looking at this!

@betatim betatim merged commit 199d14a into jupyterhub:master Jun 22, 2019
@betatim betatim deleted the slimdown-image branch June 22, 2019 13:18
@SylvainCorlay
Copy link
Collaborator

This breaks some binders that we are working on, because of the deletion of the static libraries (.a).

Indeed, we perform some compilation in the postbuild scripts, and these static libraries may be needed at this time. Beyond that, even if we actually do not use the static libraries, if the cmake targets file was generated assuming that they would be present, we can't make use of these cmake targets file anymore once the static library has been deleted.

A solution would be to perform that deletion after the postbuild script has run.

cc @VasavanThiru.

@betatim
Copy link
Member Author

betatim commented Jun 24, 2019

Can you explain a bit more (or link to an example) of when you create the .a files and when you use them?

@VasavanThiru
Copy link

Thanks, @betatim If you want a clear example where the binder seems to delete the static libraries, you can follow this link for more information about our issue.

Xeus-Calc Binder

@SylvainCorlay
Copy link
Collaborator

So indeed, it is generally reasonable to remove the static libraries (.a) since there are typically not used at runtime unlike share objects (.so).

However, there are several usecases of binder that this breaks:

  1. cling world, where the C++ interpreter may make use of static libraries at runtime. But let's put the cling-crazyness aside for now, and this is not the issue here.
  2. binders in which the postBuild scripts performs some compilation, which is a nice way to produce a binder for in-development stage of a project that would not be on conda-forge yet.

Expanding on (2):

This bit us for a project by two @QuantStack interns (@VasavanThiru and @ThibaultLacharme), that is a xeus-based "calculator" kernel for Jupyter called xeus-calc. By default, the xeus package includes both a shared object and a static library for xeus, and the xeusTargets.cmake manifest declares both.

Unfortunately, even if they want to use only the shared object, they include xeusTargets in their cmake and cmake crashes complaining that the files refered-to by the targets are missing.

A solution to that issue would be to only perform the deletion of static libraries after the postBuild stage has completed.

When installing the static library is produced, we also declare it in the xeusTargets.cmake manifest that is installed with the library, just like the shared object. Any package including

@minrk
Copy link
Member

minrk commented Jun 24, 2019

I think we should definitely not remove static libs. Removing static libraries is not a safe thing to do, since there are plenty of packages that may do run-time compilation (cython, fenics, cling, etc.) which this can break. If a package includes static libs, they should definitely be left alone.

I'm also a bit skeptical that removal of .pyc files is also recommended, since this adds a runtime cost to rebuild these files on load. But this is optimizing image size vs runtime cost, which is less clear what the best thing to do is in any given situation. In particular, this trades once-per-node startup time (image pulling .pyc files) with a cost paid every startup (recompiling used .pyc files, which is certainly fewer).

In general, I think we should try to avoid making a repo2docker install different from a normal installation, at least except where they are known cache/optimization files (.pyc qualifies, .a does not) since it deviates from the "automate best practices" goal.

Would it be okay to revert this PR and try again, perhaps?

@betatim
Copy link
Member Author

betatim commented Jun 24, 2019

I think reverting this PR is a good idea.

I'd probably restart with just deleting the JS map files.

For pyc I'd be Ok with the runtime overhead though if we are revisiting it I'd investigate how much space this actually saves.

@minrk
Copy link
Member

minrk commented Jun 24, 2019

Thanks! It would be cool, too, to breakdown the filetypes by the size impact as well, to see which ones are helping the most.

For .a files, it's possible that the 'right' thing to do is to push conda packages to split static libs into a separate package (this is the direction many are going), so that they will just start showing up less in dependencies when they aren't really needed, rather than trying to optimize here.

@SylvainCorlay
Copy link
Collaborator

For .a files, it's possible that the 'right' thing to do is to push conda packages to split static libs into a separate package (this is the direction many are going), so that they will just start showing up less in dependencies when they aren't really needed, rather than trying to optimize here.

+1. If conda-forge was to adopt more 'debian style' naming conventions, we could have split packages for doc, dev and the like...

@minrk
Copy link
Member

minrk commented Jul 16, 2019

conda-forge is in the process of splitting at least some static libs: conda-forge/hdf5-feedstock#112

markmo pushed a commit to markmo/repo2docker that referenced this pull request Jan 22, 2021
…upyterhub#709)

[MRG] Remove more files during the image build to slimdown the image
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants