-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Remove more files during the image build to slimdown the image #709
Conversation
2dbc0aa
to
cdcbc93
Compare
Just built https://github.com/betatim/requirements with
Which is surprising. 300MB gone like that? Makes me suspect we deleted something that was needed. |
I think we should merge this as it already shrinks the image size and has no downsides as far as I can tell. The decrease in coverage seems to be a flake? codecov thinks it is in the R buildpack which is untouched :-/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing my best to review, without actual understanding of anything here but reading up as I go ;D
Things I've learned
.a files
.a files are static libraries typically generated by the archive tool.
--- https://stackoverflow.com/questions/5965171/what-is-a-file-with-extension-a
.pyc files
They contain Python bytecode. Precisely when they're created and updated is IDE/interpreter-dependent (please correct me if I'm wrong here).
Since Python has to be converted to bytecode before it can be interpreted, keeping .pyc files around can help speed up execution, since you don't have to recompile. If there's a .pyc file around, there should be a .py file around with the same name.
Here are some more details:
http://www.network-theory.co.uk/...You can also force their creation to speed up execution (see the link for details).
Fun fact: You should be able to ship only the .pyc files in your module if you want lightweight/basic obfuscation in your shipped application.
--- https://www.quora.com/What-are-Python-pyc-files-and-how-are-they-used
.js.map
Javascript source maps, for debugging purposes to errors from minified javascript files to something that can be better understood.
--- https://www.html5rocks.com/en/tutorials/developertools/sourcemaps/
Actions to consider
- Apparently the -follow option that I learned is aimed to follow symbolic links is deprecated, and -L is suggested now: http://www.gnu.org/software/findutils/manual/html_mono/find.html#Symbolic-Links. Replace -follow with -L ?
This PR seems reasonable to me, but I know far too little about the details to really tell what the outcome will be in practice.
While I would not be able to do much about the info, I assume it could be relevant to realize how much MB savings comes from clearing .a, .pyc and .js.map respectively. I don't have a guesstimate.
Switched to using I think the summaries you found are spot on. The ideas for which files to delete come via the blog post linked in #707. I think it is safe to delete them and/or some shouldn't be there in the first place (like the |
Wow excellent resource thanks for linking! What about the mentioned environment variable If that is managed already, this LGTM! |
I think that we should allow Python to write TL;DR: We'd have to do more digging so let's punt it. |
I'm okay with all options and i'm currently clueless if it would be good to set the env var or not as it may or may not slow down the initial run a lot or not for example. LGTM! |
@betatim id like someones help on reviewing/merging Pipfile buildpack, but im afriad everybody is out of time to spend and it has gotten stale... I put in quite an effort to make it quicker and easier to overview recently, with full understanding of limited time and that it is hard to split up focus on many things by adding another thing to do on the list, id be very appreciative if you could find the time to look it through. |
Merging as you approved it. Thanks for looking at this! |
This breaks some binders that we are working on, because of the deletion of the static libraries ( Indeed, we perform some compilation in the postbuild scripts, and these static libraries may be needed at this time. Beyond that, even if we actually do not use the static libraries, if the cmake targets file was generated assuming that they would be present, we can't make use of these cmake targets file anymore once the static library has been deleted. A solution would be to perform that deletion after the postbuild script has run. cc @VasavanThiru. |
Can you explain a bit more (or link to an example) of when you create the |
Thanks, @betatim If you want a clear example where the binder seems to delete the static libraries, you can follow this link for more information about our issue. |
So indeed, it is generally reasonable to remove the static libraries (.a) since there are typically not used at runtime unlike share objects (.so). However, there are several usecases of binder that this breaks:
Expanding on (2): This bit us for a project by two @QuantStack interns (@VasavanThiru and @ThibaultLacharme), that is a xeus-based "calculator" kernel for Jupyter called xeus-calc. By default, the xeus package includes both a shared object and a static library for xeus, and the Unfortunately, even if they want to use only the shared object, they include xeusTargets in their cmake and cmake crashes complaining that the files refered-to by the targets are missing. A solution to that issue would be to only perform the deletion of static libraries after the postBuild stage has completed. When installing the static library is produced, we also declare it in the |
I think we should definitely not remove static libs. Removing static libraries is not a safe thing to do, since there are plenty of packages that may do run-time compilation (cython, fenics, cling, etc.) which this can break. If a package includes static libs, they should definitely be left alone. I'm also a bit skeptical that removal of .pyc files is also recommended, since this adds a runtime cost to rebuild these files on load. But this is optimizing image size vs runtime cost, which is less clear what the best thing to do is in any given situation. In particular, this trades once-per-node startup time (image pulling .pyc files) with a cost paid every startup (recompiling used .pyc files, which is certainly fewer). In general, I think we should try to avoid making a repo2docker install different from a normal installation, at least except where they are known cache/optimization files (.pyc qualifies, .a does not) since it deviates from the "automate best practices" goal. Would it be okay to revert this PR and try again, perhaps? |
I think reverting this PR is a good idea. I'd probably restart with just deleting the JS map files. For |
Thanks! It would be cool, too, to breakdown the filetypes by the size impact as well, to see which ones are helping the most. For .a files, it's possible that the 'right' thing to do is to push conda packages to split static libs into a separate package (this is the direction many are going), so that they will just start showing up less in dependencies when they aren't really needed, rather than trying to optimize here. |
+1. If conda-forge was to adopt more 'debian style' naming conventions, we could have split packages for |
conda-forge is in the process of splitting at least some static libs: conda-forge/hdf5-feedstock#112 |
…upyterhub#709) [MRG] Remove more files during the image build to slimdown the image
This is a test related to #707: removing static libraries and such.