-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nvidia based notebooks to stack #1196
Conversation
Not sure the failing test is correct as the run_hook was copied from the regular scipy which also doesn't have an empty line at the end of the file. |
@davidspek Do you know how big the *-nvidia images are? In the past we had issues with large images that took a long time to build. |
@Bidek56 I already created my own repo using the same GitHub actions for testing. Seeing as the only difference between the minimal and scipy notebook is the base image and for tensorflow the base image and base-notebook is 2.82 GB |
@davidspek Do you know how big the *-nvidia images are? Thx |
@davidspek Thank you for sharing your work! We've been requesting that all new image definitions take the community images path, primarily because this repository is unwieldy and we're already finding it difficult to maintain the images that already exist here. It sounds like you already have a nice project setup for building the images. Would you be open to submitting a pull request linking to it from the documentation instead of this PR? We've also discussed including closed source software in the images maintained here in the past and decided against it (#706, #516). While I've heard the CUDA EULA has become more lenient, I believe it's still closed-source. |
@Bidek56 Sorry for the confusing naming scheme above as I had not changed the names of the images to *-nvidia, but those sizes are for the nvidia based images. @parente Thank you for informing me about the community image path. While I understand the reasoning behind it, I personally think an official source for GPU based images would be handy. There already is a community image path for GPU docker images, but these are somewhat outdated and I think users are more likely to submit pull requests for the official jupyter repository. Regarding the maintenance, would it be manageable to have the Regarding the closed source software discussion. While CUDA itself is closed source, there should be no issue with having these dockerfiles and the resulting images on dockerhub. As discussed here:
It seems the only thing that would need to be done is including https://docs.nvidia.com/cuda/eula/index.html and https://gitlab.com/nvidia/container-images/cuda/blob/master/LICENSE with the dockerfile (and maybe also in the image itself). Which is an easy thing to add. Also, given that I based the Given all this, would there be a possibility to add official gpu images using the nvidia base image? What would the requirements be to be able to add these images? |
@davidspek One idea, (legal stuff aside) is to add the CUDA libraries at the end with a single code base. I am not sure where you are getting the images sizes from but when I build them, the gpu images are 10x in size: |
@Bidek56 I agree that it would be better to handle the downstream images differently as the base image just needs to be changed (except for the tensorflow Dockerfile as the gpu version of TF needs to be installed). This is also what I suggested doing above, and I wanted some input on how to best achieve this before spending time figuring this out. Regarding adding the CUDA libraries at the end. I haven't had success with just installing the apt packages, there seem to be some other prerequisites in the CUDA base image that also need to be installed. I could look into that, but this brings with it the maintenance of whatever goes on in the CUDA base image rather than using the image directly. I was grabbing the sizes from dockerhub as that was what I had by hand at the time. Given the large amount of software that is installed I am not surprised that the image size in 6.39 GB, but that seems fairly unavoidable. Another option would be to add just a |
@davidspek I think you are spending a lot of time on these images, but few existing GPU images already exist: joehoeller, rlan |
@Bidek56 I know images exist, but as you can see the repo from joehoeller pulls from the repo from rlan. The base images rlan are also not kept very up to date. Also, I don't think there will be as much of a community effort to keep those repo's up to date as the official jupyter-stack repo. Furthermore, with the security of docker images being in the news often the past while, it is important to use images from a trusted source. With the two repo's you describe, there are multiple steps to other maintainers before you get to the actual base image that is used. I have just added a |
I think extending the existing images like christianscharr or radix-ai should be sufficient. |
Do you mean using those as a base image? While the Dockerfiles might always pull the most recent version from jupyeter/scipy, the images on dockerhub (if they are on there as I don't see a link) will not be up to date unless they manually build a new image and push it for each update to the official jupyter stack. Also, that doesn't address the communities effort to update packages and docker image security (still pulling from an unknown entity, who says the Dockerfile on github is the actual image).
|
I have no idea, but I suspect a majority of jupyter users at this point would use GPU access if they had it? "Scientists / data scientists that don't have flows / libraries which can benefit from GPU acceleration" is a shrinking set. That said, it /is/ a horrifying package size just to add "gpu drivers", which makes all aspects of build system more painful. We're very much experiencing that with our own jupyter-notebook+nvidia builds. |
@snickell This might be my lack of experience talking, but what exactly makes the build system more painful just because of image size? I'm happy to try and find a solution if I know what the problems are. |
@davidspek The problem are:
I would suggest taking the community images path after you figure out how to add the CUDA libraries to each of the Dockerfile. |
@Bidek56 |
I see that you changed the PR to just one extra image? |
@Bidek56 Indeed it is just one extra image now. It can't use the tensorflow-notebook as a base as tensorflow-gpu needs to be installed. Unless further increasing the image size by installing tensorflow-gpu on top of tensorflow-cpu, or uninstalling tensorflow-cpu and then installing tensorflow-gpu during the build process is acceptable. If the issue is a time limit, I can see that what is costing the most time by far is resolving conda packages. Maybe a potential fix for this is have conda install mamba in the base or minimal notebook (or which ever notebook installs conda) and have mamba resolve the later packages. This might speed up the package resolving time and I believe will still allow users to use conda to install packages in a notebook. |
@Bidek56 The build has passed both on my own repo as well as for the check. The build time seems to be about 10 minutes for the image as it is now. Does that seem acceptable? |
GH Actions says: 1h 0m 10s vs. ~40m for the other images. @romainx what do you think? |
I can still try to optimize the image further, both in build time and size, and I will do some final testing with the image to make sure the GPU is properly initialized. But first I'd like to discuss in what way and with what requirements it would be possible to add this. Honestly, I think most people using tensorflow are probably using it with a GPU and it would be nice to be able to use the tagging infrastructure and trusted source of the official repo for people to use as base images. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
I will not comment the question of merging or not merging this into the official docker images, because I'm not a maintainer, and I do not know all the policies here.
But I made some comments, which I hope will make your image a bit better.
And I know nothing about GPUs, sorry for that :)
tensorflow-notebook-gpu/Dockerfile
Outdated
curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - && \ | ||
echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list && \ | ||
echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use 20.04 for quite a while. And, as I see here, ubuntu2004 repo is available.
The same goes about other repos as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason the 18.04 repo is used is because not all the packages seem to be present in the 20.04 repo. I am using the tensorflow Dockerfiles as an example (which use 18.04) for this Dockerfile, and if I remember correctly the following package cannot be found on that repository.
libcudnn8=${CUDNN}+cuda${CUDA}
libnvinfer${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA}
libnvinfer-plugin${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8_8.0.5.39-1+cuda11.1_amd64.deb exists.
Could you please check, if your info is up-to-date?
If something is missing, then I guess it's ok to use old repos.
But it's worth adding a comment with the packages missing in newer ubuntu version repo.
curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - && \ | ||
echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list && \ | ||
echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list && \ | ||
apt-get purge --autoremove -y curl \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice!
We should probably delete so called "build requirements" in other packages as well.
Could you create an issue please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1199
Actually, I just noticed this line is stupid here as on line 61 curl gets installed again haha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you can delete it there as well 👍
cuda-cudart-11-1=11.1.74-1 \ | ||
cuda-compat-11-1 \ | ||
&& ln -s cuda-11.1 /usr/local/cuda && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's create an environment variable for the versions here or use the existing ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it then not also be a good idea to use environment variables for the tensorflow version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not, because it is not duplicated many times.
And creating variables for each conan/pip package would make some Dockerfiles harder to read.
\`\`\` | ||
$(docker run --rm ${IMAGE_NAME} apt list --installed) | ||
\`\`\` | ||
EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shold probably add some gpu specific info here.
I think several images, which do have some specific stuff, have a little bit different run_hook files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about this as well. Probably would be good to have an image tag with the CUDA version as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
@@ -0,0 +1,30 @@ | |||
# Copyright (c) Jupyter Development Team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to check some gpu specific stuff here?
As far as I understand, this file is the same as tensorflow-notebook/test/test_tensorflow.py, so it might be worth to check something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree it would be good to test GPU stuff here, I'm not sure it will be possible to test functional stuff. For example, what I noticed when I created a non-functional GPU docker image is that nvidia-smi
ran without a problem and I could see the GPU info, however, for CUDA version it would say N/A
and as such the image was not usable. Doing a test like that here, or something like torch.cuda.is_available()
in pytorch would be great, but it would require a GPU on the machine running the tests.
Do you have an idea of what type of tests could be done without requiring a GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's tricky, yes.
To be honest, I'm not familiar with GPUs at all, so no ideas so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I say we just run nvidia-smi, as you did and check that error code is zero.
That's not perfect, but better than nothing.
|
||
- [Jupyter Docker Stacks on ReadTheDocs](http://jupyter-docker-stacks.readthedocs.io/en/latest/index.html) | ||
- [Selecting an Image :: Core Stacks :: jupyter/tensorflow-notebook](http://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#jupyter-tensorflow-notebook) | ||
- [Image Specifics :: Tensorflow](http://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html#tensorflow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely add, which GPUs are supported.
It should be either list of GPUs or some kind of requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main requirement is CUDA and NVIDIA drivers must be installed on the host. As for GPUs, I think anything released in the last 10 years will work. If I look at the cuda-gpus page the lowest entry is the GeForce GT 430. Arguably, I think anybody looking to use tensorflow with GPUs will have a more modern GPU and (hopefully) have spent the time to make a good decision what GPU to get for their ML work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having first link with the comment about host driver is fine for me.
Regarding the image size I have used Dive (amazing tool I just discorded) to try and figure out how to reduce the image size. The addition of CUDA, cuDNN, libnvinfer (TensorRT), etc is 5.2 GB. When exploring the actual files that are added in the layer 5.1 GB is from the packages mention earlier. Looking even deeper that size is made up mostly out of a few packages which seem to be essential to CUDA, cuDNN and TensorRT. Given that I used the official tensorflow image as a template to create this image I am not sure what parts of the installation can be omitted, but I also wonder if removing something such as TensorRT (which is not strictly necessary for using GPUs, but is for using the tensor cores I believe) is worth it as it does make the image less useful to users. |
Dive is an amazing tool indeed. I also used it to optimize docker images a while ago. |
Hello, I'm jumping in the conversation just to summarize my understanding of the situation regarding this PR.
I don't have a clear opinion on this subject. However, I'm guessing demand for this kind of out-of-the-box setup is growing, so it might make sense to take a step in that direction if everything is clear and aligned with Jupyter's policy. Sorry it's not a clear black or white answer, but I don't know the subject well enough. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello,
Thanks for this contribution. A great review has already been done by @mathbunnyru, I've nothing to add except some minors points. For the go further on this PR, see my comment.
Best.
@@ -0,0 +1,2 @@ | |||
# Documentation | |||
README.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add hooks
and test
dir like in the PR #1201
|
||
RUN apt-get update && apt-get install -y --no-install-recommends \ | ||
gnupg2 curl ca-certificates && \ | ||
curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use wget
that is already available instead of installing curl
, it should be easier.
@romainx Thank you for your comments. At the moment I am busy with a kubernetes cluster migration, but I hope to have some time to look into your points soon. Regarding the use of |
@davidspek you're right but we could decorate it with a special docker-stacks/test/test_outdated.py Lines 13 to 14 in a0a544e
This marker is skipped during normal test execution ( |
@romainx Thanks for your summary and @mathbunnyru for contributing a review. I'd like to provide a bit more background on one of the points above and add another. @romainx said:
The last we, meaning various Project Jupyter maintainers, discussed this subject, we decided against distributing non-open source software as a matter of principle, not because of the NVIDIA EULA per se: #516 (comment) If there's interest and an argument for revisiting that decision, we should find an appropriate venue to make the decision-making process more transparent than the last time, perhaps relying on jupyter/governance#87. I mentioned:
I've personally viewed the images here as being in "maintenance mode" for some time now: sustainable as-is but not something worth extending. My comment (jupyterhub/mybinder.org-deploy#1474 (comment)) on an issue in the Binder project explains why I view the project this way. My recommendation for making this PR a separate community stack stems from this perspective and experience. (For what it's worth, @minrk's take is similar: jupyterhub/mybinder.org-deploy#1474 (comment)) All of this is my opinion. If people feel strongly that this project has a future beyond the status quo, want to revisit the de facto approach of recommending community stacks, and are ready to take on the mantle of maintenance and extension here, I don't want to be a blocker. In full disclosure, I have not used the stacks images for at least 4 years now, and it's probably past time for me to move on. |
Thank you @parente for you -- as always -- wise, meaningful and very carefully written comment. According to this comment and I share the same opinion this point, closed-source problem is a show stopper and the conclusion is to take the community images path, say that despite all this valuable work -- thanks again @davidspek On the last part, which more generally concerns the objectives and the future of this project, I would like to give my vision which is a little different.
I agree they are sustainable as-is but not something worth extending. Still, we did a lot of work to make images well written, linted, tested and documented. The proof is that images are up to date. I think the main blocker is now -- and even if it's quite robust now -- the build that is too monolithic. If we add additional stacks, the build will grow and will become error prone and it will become a pain to publish all the stacks. The build takes between 40 min and 1 hour to complete and if a single thing fails (a test, a package installation) everything fails. I've created an issue for that #1203. My conviction is that a lot of people are happy with these stacks
Based on the notebook shared in #693 I've made a -- quick and dirty -- comparison showing the average number of pulls by day for each image, but it seems obvious that they are used: So I will be happy to continue my effort to maintain and improve the stack in a reasonable and measured way as we have always tried to do. @parente I hope you will be a part of it because you have made so much things to achieve this success -- because despite some negative points you highlight it's a success -- and because it is really a pleasure to collaborate with you and learn from you 😍. |
@romainx I don't want to eject and leave you as the sole maintainer here either. I will 100% support you with reviews and merges when @-tagged for as long as I can. I just don't think I'm a good source for inspiration on "what comes next?" as a non-user. On that note, we should definitely keep eyes on expanding the pool of maintainers, especially if you intend to grow the number of stacks and architectures through build and automation improvements. |
@parente cool this is a very good news! I am relieved 😌.
I'm 100% for expanding the pool of maintainers but my intend is not to grow the number of stacks by changing the orientation of the community stack path. I agree that we cannot maintain here notebooks for a bunch of different technologies. I think we need to stick with this core stack. But event in this scope, some extensions could make sense: supporting other CPU architectures or providing light Sparkmagic images to connect to exiting Spark clusters. |
@romainx I have no experience with being a maintainer on any project, but what type of work/commitment does it take and what are the responsibilities? I am mostly busy with Kubeflow and the Notebooks working group is just getting started, so this PR is potentially useful for there instead. However, depending on how things go, it might be an idea to further discuss an "official" tensorflow GPU notebook repo if you are open to it. |
@davidspek Like I said I still think that a tensorflow GPU notebook is a good idea -- it seems to have been confirmed by the figures of the images pull the |
@davidspek Hi there, first of all I want to thank you a lot for your work on this. I have built the image from the dockerfile in your last commit
Here is the full build output However, when running the container, like this:
And trying to run code from a Keras MNIST example, I get the errors (in the container's stdout):
And the Also, running
Indeed, I checked that the directory I tried adding It seems to be looking for files for CUDA 10 (?). Running other gpu-enabled images work as expected:
Do you know what could be the cause ? PS: Sorry if this isn't the proper channel to ask this, I don't know where else to do so. |
@Atralb I believe I might know what the problem is. However, as this PR is not merged I am not yet sure what will happen with my repo. This also depends what direction we go with the Kubeflow notebook images, which I hope could also serve as a general GPU image for others. To fix the problem with your dockerfile, try adding this code:
|
This PR adds NVIDIA based images to the stack that allow for TensorFlow to use GPU's. It uses Ubuntu 18.04 due to an issue with installing the libcudnn8, libnvinfer and libnvinfer-plugin on the Focal. I based the apt packages off of the official TF dockerfile. The below code might still need to be added to the tensorflow dockerfile as it is missing compared to the official TensorFlow dockerfile but I haven't run into issues without it while using PyTorch in the resulting TensorFlow notebook.