Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nvidia based notebooks to stack #1196

Closed
wants to merge 7 commits into from
Closed

Conversation

davidspek
Copy link
Contributor

This PR adds NVIDIA based images to the stack that allow for TensorFlow to use GPU's. It uses Ubuntu 18.04 due to an issue with installing the libcudnn8, libnvinfer and libnvinfer-plugin on the Focal. I based the apt packages off of the official TF dockerfile. The below code might still need to be added to the tensorflow dockerfile as it is missing compared to the official TensorFlow dockerfile but I haven't run into issues without it while using PyTorch in the resulting TensorFlow notebook.

# Install TensorRT if not building for PowerPC
RUN [[ "${ARCH}" = "ppc64le" ]] || { apt-get update && \
        apt-get install -y --no-install-recommends libnvinfer${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA} \
        libnvinfer-plugin${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA} \
        && apt-get clean \
        && rm -rf /var/lib/apt/lists/*; }

# For CUDA profiling, TensorFlow requires CUPTI.
ENV LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Link the libcuda stub to the location where tensorflow is searching for it and reconfigure
# dynamic linker run-time bindings
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 \
    && echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/z-cuda-stubs.conf \
    && ldconfig

@davidspek
Copy link
Contributor Author

Not sure the failing test is correct as the run_hook was copied from the regular scipy which also doesn't have an empty line at the end of the file.

@Bidek56
Copy link
Contributor

Bidek56 commented Dec 7, 2020

@davidspek Do you know how big the *-nvidia images are? In the past we had issues with large images that took a long time to build.
Adding 4 new notebooks to the stacks is a big commitment for the maintainers.

@davidspek
Copy link
Contributor Author

@Bidek56 I already created my own repo using the same GitHub actions for testing. Seeing as the only difference between the minimal and scipy notebook is the base image and for tensorflow the base image and -gpu appended to tensorflow it might be possible to incorporate those variable in the image building script. Then it would only require maintaining the base-notebook and the few extra dependency version numbers compared to the regular base-notebook. Personally, I think the GPU version of tensorflow is pretty essential and people would like to have an "official" jupyter image for it that can then be used as a base image. I work with Kubeflow and given the outdated state of those images I make all my own using the jupyter stack images (a notebook workgroup has been made to improve the situation). These GPU images could then also serve as a good basis for Kubeflow users (which just need to edit the CMD to CMD ["sh","-c", "jupyter lab --notebook-dir=/home/${NB_USER} --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"]).

base-notebook is 2.82 GB
minimal-notebook is 2.94 GB
scipy-notebook is 3.14 GB
tensorflow-notebook is 3.49 GB

@Bidek56
Copy link
Contributor

Bidek56 commented Dec 7, 2020

@davidspek Do you know how big the *-nvidia images are? Thx

@parente
Copy link
Member

parente commented Dec 7, 2020

@davidspek Thank you for sharing your work! We've been requesting that all new image definitions take the community images path, primarily because this repository is unwieldy and we're already finding it difficult to maintain the images that already exist here. It sounds like you already have a nice project setup for building the images. Would you be open to submitting a pull request linking to it from the documentation instead of this PR?

We've also discussed including closed source software in the images maintained here in the past and decided against it (#706, #516). While I've heard the CUDA EULA has become more lenient, I believe it's still closed-source.

@parente parente added the tag:Community Stack We recommend this contribution become a community stack. label Dec 7, 2020
@davidspek
Copy link
Contributor Author

@Bidek56 Sorry for the confusing naming scheme above as I had not changed the names of the images to *-nvidia, but those sizes are for the nvidia based images.

@parente Thank you for informing me about the community image path. While I understand the reasoning behind it, I personally think an official source for GPU based images would be handy. There already is a community image path for GPU docker images, but these are somewhat outdated and I think users are more likely to submit pull requests for the official jupyter repository. Regarding the maintenance, would it be manageable to have the base-notebook-nvidia added and have the other images edited with a script? Any changes that happen to the minimal, scipy, and tensorflow notebook Dockerfiles would be the same for the nvidia images.

Regarding the closed source software discussion. While CUDA itself is closed source, there should be no issue with having these dockerfiles and the resulting images on dockerhub. As discussed here:

The docker scripts contained in this repo are licensed BSD 3-Clause. You can publish Dockerfiles but the license must accompany the modified scripts.
You are allowed to publish the images to docker hub and nvcr.io (ngc). Publishing to a public private image repository is not allowed.
The NVIDIA CUDA Driver Libraries are only distributable in applications that meet this criteria:



The application was developed starting from a NVIDIA CUDA container obtained from Docker Hub or the NVIDIA GPU Cloud, and
The resulting application is packaged as a Docker container and distributed to users on Docker Hub or the NVIDIA GPU Cloud only.

It seems the only thing that would need to be done is including https://docs.nvidia.com/cuda/eula/index.html and https://gitlab.com/nvidia/container-images/cuda/blob/master/LICENSE with the dockerfile (and maybe also in the image itself). Which is an easy thing to add. Also, given that I based the base-notebook-nvidia off of the official tensorflow dockerfile, I would not think there would be any issues in regards to licensing (while I do think the name should be changed from *-nvidia to *-gpu due to the license).

Given all this, would there be a possibility to add official gpu images using the nvidia base image? What would the requirements be to be able to add these images?

@Bidek56
Copy link
Contributor

Bidek56 commented Dec 8, 2020

@davidspek
The difference between the existing Dockerfiles and the *-gpu are just the base images, the design pattern for the *-gpu Dockerfiles makes no sense, it's not maintainable and it's a bad practice to have copies of the same code.

One idea, (legal stuff aside) is to add the CUDA libraries at the end with a single code base.

I am not sure where you are getting the images sizes from but when I build them, the gpu images are 10x in size:
REPOSITORY TAG CREATED SIZE
jupyter/base-notebook latest About a minute ago 639MB
jupyter/base-notebook-gpu latest 8 minutes ago 6.39GB

@davidspek
Copy link
Contributor Author

@Bidek56 I agree that it would be better to handle the downstream images differently as the base image just needs to be changed (except for the tensorflow Dockerfile as the gpu version of TF needs to be installed). This is also what I suggested doing above, and I wanted some input on how to best achieve this before spending time figuring this out.

Regarding adding the CUDA libraries at the end. I haven't had success with just installing the apt packages, there seem to be some other prerequisites in the CUDA base image that also need to be installed. I could look into that, but this brings with it the maintenance of whatever goes on in the CUDA base image rather than using the image directly.

I was grabbing the sizes from dockerhub as that was what I had by hand at the time. Given the large amount of software that is installed I am not surprised that the image size in 6.39 GB, but that seems fairly unavoidable.

Another option would be to add just a tensorflow-notebook-gpu which uses the CUDA image as base and installs everything from the minimal and scipy directly as well. But this will increase build times if, for example, the TensorFlow version is updated. It also wouldn't allow for a hypothetical PyTorch notebook that uses the scipy-notebook-gpu as a base.

@Bidek56
Copy link
Contributor

Bidek56 commented Dec 8, 2020

@davidspek I think you are spending a lot of time on these images, but few existing GPU images already exist: joehoeller, rlan

@davidspek
Copy link
Contributor Author

davidspek commented Dec 8, 2020

@Bidek56 I know images exist, but as you can see the repo from joehoeller pulls from the repo from rlan. The base images rlan are also not kept very up to date. Also, I don't think there will be as much of a community effort to keep those repo's up to date as the official jupyter-stack repo. Furthermore, with the security of docker images being in the news often the past while, it is important to use images from a trusted source. With the two repo's you describe, there are multiple steps to other maintainers before you get to the actual base image that is used.

I have just added a tensorflow-notebook-gpu-test which uses the existing sci-py notebook and adds the necessary steps from the CUDA and official tensorflow images. It actually seems the CUDA image does little more than add some apt repositories to be able to install the cuda tools and such, so there are is little more to maintain compared to using the nvidia/cuda as a base image. What do you think of this? Locally the image shows as 8.59GB btw.

@Bidek56
Copy link
Contributor

Bidek56 commented Dec 8, 2020

I think extending the existing images like christianscharr or radix-ai should be sufficient.

@davidspek
Copy link
Contributor Author

Do you mean using those as a base image? While the Dockerfiles might always pull the most recent version from jupyeter/scipy, the images on dockerhub (if they are on there as I don't see a link) will not be up to date unless they manually build a new image and push it for each update to the official jupyter stack. Also, that doesn't address the communities effort to update packages and docker image security (still pulling from an unknown entity, who says the Dockerfile on github is the actual image).

tensorflow-notebook-gpu-test essentially does the same as what they do in their dockerfiles, but up-to-date. It will also have the benefit of the automated tagging scheme and updates when the underlying stack is updated along with the build history in the wiki.

@snickell
Copy link

I have no idea, but I suspect a majority of jupyter users at this point would use GPU access if they had it?

"Scientists / data scientists that don't have flows / libraries which can benefit from GPU acceleration" is a shrinking set.

That said, it /is/ a horrifying package size just to add "gpu drivers", which makes all aspects of build system more painful. We're very much experiencing that with our own jupyter-notebook+nvidia builds.

@davidspek
Copy link
Contributor Author

@snickell This might be my lack of experience talking, but what exactly makes the build system more painful just because of image size? I'm happy to try and find a solution if I know what the problems are.

@Bidek56
Copy link
Contributor

Bidek56 commented Dec 14, 2020

@davidspek The problem are:

  1. Your PR duplicates code which is a maintenance headache
  2. Large image sizes have resulted in a time out in the build process in the past

I would suggest taking the community images path after you figure out how to add the CUDA libraries to each of the Dockerfile.
Thanks

@davidspek
Copy link
Contributor Author

@Bidek56
As we discussed above, to minimize the duplication of code I added a Dockerfile that adds the GPU repositories and drivers at the same stage as installing tensorflow. To make the PR more clear I have removed the previous method which uses the NVIDIA base image. At this point there is no duplication of code.
Regarding the time out during the build process, do you have log files or information what exactly caused this?

@Bidek56
Copy link
Contributor

Bidek56 commented Dec 14, 2020

I see that you changed the PR to just one extra image?
If we knew what was causing the build failures, we would have fixed it, but the only solution that @parente found was to reduce the size of the images. Granted back then we were using Travis-CI, now with GH Actions should be better, but would not surprised that GH puts a time limit or size limit on the builds any day now.

@davidspek
Copy link
Contributor Author

@Bidek56 Indeed it is just one extra image now. It can't use the tensorflow-notebook as a base as tensorflow-gpu needs to be installed. Unless further increasing the image size by installing tensorflow-gpu on top of tensorflow-cpu, or uninstalling tensorflow-cpu and then installing tensorflow-gpu during the build process is acceptable.

If the issue is a time limit, I can see that what is costing the most time by far is resolving conda packages. Maybe a potential fix for this is have conda install mamba in the base or minimal notebook (or which ever notebook installs conda) and have mamba resolve the later packages. This might speed up the package resolving time and I believe will still allow users to use conda to install packages in a notebook.

@davidspek
Copy link
Contributor Author

@Bidek56 The build has passed both on my own repo as well as for the check. The build time seems to be about 10 minutes for the image as it is now. Does that seem acceptable?

@Bidek56
Copy link
Contributor

Bidek56 commented Dec 14, 2020

GH Actions says: 1h 0m 10s vs. ~40m for the other images. @romainx what do you think?

@davidspek
Copy link
Contributor Author

I can still try to optimize the image further, both in build time and size, and I will do some final testing with the image to make sure the GPU is properly initialized. But first I'd like to discuss in what way and with what requirements it would be possible to add this. Honestly, I think most people using tensorflow are probably using it with a GPU and it would be nice to be able to use the tagging infrastructure and trusted source of the official repo for people to use as base images.

Copy link
Member

@mathbunnyru mathbunnyru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

I will not comment the question of merging or not merging this into the official docker images, because I'm not a maintainer, and I do not know all the policies here.

But I made some comments, which I hope will make your image a bit better.
And I know nothing about GPUs, sorry for that :)

Comment on lines 22 to 24
curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - && \
echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list && \
echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use 20.04 for quite a while. And, as I see here, ubuntu2004 repo is available.
The same goes about other repos as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason the 18.04 repo is used is because not all the packages seem to be present in the 20.04 repo. I am using the tensorflow Dockerfiles as an example (which use 18.04) for this Dockerfile, and if I remember correctly the following package cannot be found on that repository.

libcudnn8=${CUDNN}+cuda${CUDA}
libnvinfer${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA}
libnvinfer-plugin${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8_8.0.5.39-1+cuda11.1_amd64.deb exists.
Could you please check, if your info is up-to-date?

If something is missing, then I guess it's ok to use old repos.
But it's worth adding a comment with the packages missing in newer ubuntu version repo.

curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - && \
echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list && \
echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list && \
apt-get purge --autoremove -y curl \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice!
We should probably delete so called "build requirements" in other packages as well.
Could you create an issue please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1199
Actually, I just noticed this line is stupid here as on line 61 curl gets installed again haha.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you can delete it there as well 👍

Comment on lines +32 to +34
cuda-cudart-11-1=11.1.74-1 \
cuda-compat-11-1 \
&& ln -s cuda-11.1 /usr/local/cuda && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create an environment variable for the versions here or use the existing ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it then not also be a good idea to use environment variables for the tensorflow version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not, because it is not duplicated many times.
And creating variables for each conan/pip package would make some Dockerfiles harder to read.

\`\`\`
$(docker run --rm ${IMAGE_NAME} apt list --installed)
\`\`\`
EOF
Copy link
Member

@mathbunnyru mathbunnyru Dec 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shold probably add some gpu specific info here.
I think several images, which do have some specific stuff, have a little bit different run_hook files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this as well. Probably would be good to have an image tag with the CUDA version as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

@@ -0,0 +1,30 @@
# Copyright (c) Jupyter Development Team.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to check some gpu specific stuff here?

As far as I understand, this file is the same as tensorflow-notebook/test/test_tensorflow.py, so it might be worth to check something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree it would be good to test GPU stuff here, I'm not sure it will be possible to test functional stuff. For example, what I noticed when I created a non-functional GPU docker image is that nvidia-smi ran without a problem and I could see the GPU info, however, for CUDA version it would say N/A and as such the image was not usable. Doing a test like that here, or something like torch.cuda.is_available() in pytorch would be great, but it would require a GPU on the machine running the tests.
Do you have an idea of what type of tests could be done without requiring a GPU?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's tricky, yes.
To be honest, I'm not familiar with GPUs at all, so no ideas so far.

Copy link
Member

@mathbunnyru mathbunnyru Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I say we just run nvidia-smi, as you did and check that error code is zero.
That's not perfect, but better than nothing.


- [Jupyter Docker Stacks on ReadTheDocs](http://jupyter-docker-stacks.readthedocs.io/en/latest/index.html)
- [Selecting an Image :: Core Stacks :: jupyter/tensorflow-notebook](http://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#jupyter-tensorflow-notebook)
- [Image Specifics :: Tensorflow](http://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html#tensorflow)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely add, which GPUs are supported.
It should be either list of GPUs or some kind of requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main requirement is CUDA and NVIDIA drivers must be installed on the host. As for GPUs, I think anything released in the last 10 years will work. If I look at the cuda-gpus page the lowest entry is the GeForce GT 430. Arguably, I think anybody looking to use tensorflow with GPUs will have a more modern GPU and (hopefully) have spent the time to make a good decision what GPU to get for their ML work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having first link with the comment about host driver is fine for me.

@davidspek
Copy link
Contributor Author

Regarding the image size I have used Dive (amazing tool I just discorded) to try and figure out how to reduce the image size. The addition of CUDA, cuDNN, libnvinfer (TensorRT), etc is 5.2 GB. When exploring the actual files that are added in the layer 5.1 GB is from the packages mention earlier. Looking even deeper that size is made up mostly out of a few packages which seem to be essential to CUDA, cuDNN and TensorRT. Given that I used the official tensorflow image as a template to create this image I am not sure what parts of the installation can be omitted, but I also wonder if removing something such as TensorRT (which is not strictly necessary for using GPUs, but is for using the tensor cores I believe) is worth it as it does make the image less useful to users.

@mathbunnyru
Copy link
Member

Dive is an amazing tool indeed. I also used it to optimize docker images a while ago.

@romainx
Copy link
Collaborator

romainx commented Dec 21, 2020

Hello,

I'm jumping in the conversation just to summarize my understanding of the situation regarding this PR.

  • The closed-source problem and the NVIDIA CUDA EULA mentioned by @parente:
    This point need to be checked carefully and documented by someone with a good understanding of this topic.
  • Host / Hardware requirements
    We have to document those requirements like in the TF page, or put a link to the page like suggested by @mathbunnyru.
    We need to have a simple way to check if the requirements are OK, like a test for example, without that we may have to deal with issues related to hardware prerequisites. The suggestion from @mathbunnyru of using nvidia-smi is good and should be enough.
  • Image size: This point cannot be improved better than you've already done (separated image and efforts to reduce the size as much as we can.

I don't have a clear opinion on this subject. However, I'm guessing demand for this kind of out-of-the-box setup is growing, so it might make sense to take a step in that direction if everything is clear and aligned with Jupyter's policy. Sorry it's not a clear black or white answer, but I don't know the subject well enough.

Copy link
Collaborator

@romainx romainx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,

Thanks for this contribution. A great review has already been done by @mathbunnyru, I've nothing to add except some minors points. For the go further on this PR, see my comment.

Best.

@@ -0,0 +1,2 @@
# Documentation
README.md
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add hooks and test dir like in the PR #1201


RUN apt-get update && apt-get install -y --no-install-recommends \
gnupg2 curl ca-certificates && \
curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - && \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use wget that is already available instead of installing curl, it should be easier.

@davidspek
Copy link
Contributor Author

@romainx Thank you for your comments. At the moment I am busy with a kubernetes cluster migration, but I hope to have some time to look into your points soon. Regarding the use of nvidia-smi for the testing, I assume that is while building the images. In which case, I don't think the GitHub action will have access to a NVIDIA GPU, which would make that form of testing difficult.

@romainx
Copy link
Collaborator

romainx commented Dec 21, 2020

@davidspek you're right but we could decorate it with a special @pytest.mark.info marker. Like it's done here.

@pytest.mark.info
def test_outdated_packages(container, specifications_only=True):

This marker is skipped during normal test execution (pytest -m "not info" test) but can be run if needed. It's a way to have the test ready in the code.

@parente
Copy link
Member

parente commented Dec 26, 2020

@romainx Thanks for your summary and @mathbunnyru for contributing a review. I'd like to provide a bit more background on one of the points above and add another.

@romainx said:

The closed-source problem and the NVIDIA CUDA EULA mentioned by @parente:
This point need to be checked carefully and documented by someone with a good understanding of this topic.

The last we, meaning various Project Jupyter maintainers, discussed this subject, we decided against distributing non-open source software as a matter of principle, not because of the NVIDIA EULA per se: #516 (comment)

If there's interest and an argument for revisiting that decision, we should find an appropriate venue to make the decision-making process more transparent than the last time, perhaps relying on jupyter/governance#87.

I mentioned:

We've been requesting that all new image definitions take the community images path, primarily because this repository is unwieldy and we're already finding it difficult to maintain the images that already exist here.

I've personally viewed the images here as being in "maintenance mode" for some time now: sustainable as-is but not something worth extending. My comment (jupyterhub/mybinder.org-deploy#1474 (comment)) on an issue in the Binder project explains why I view the project this way. My recommendation for making this PR a separate community stack stems from this perspective and experience. (For what it's worth, @minrk's take is similar: jupyterhub/mybinder.org-deploy#1474 (comment))

All of this is my opinion. If people feel strongly that this project has a future beyond the status quo, want to revisit the de facto approach of recommending community stacks, and are ready to take on the mantle of maintenance and extension here, I don't want to be a blocker. In full disclosure, I have not used the stacks images for at least 4 years now, and it's probably past time for me to move on.

@romainx
Copy link
Collaborator

romainx commented Dec 27, 2020

Thank you @parente for you -- as always -- wise, meaningful and very carefully written comment.

According to this comment and I share the same opinion this point, closed-source problem is a show stopper and the conclusion is to take the community images path, say that despite all this valuable work -- thanks again @davidspek
for all this work -- we will not merge this PR into this repo. We hope that all the work done here, the reviews (thanks again @mathbunnyru) made will be valuable and will contribute to improve the outcome.


On the last part, which more generally concerns the objectives and the future of this project, I would like to give my vision which is a little different.

I've personally viewed the images here as being in "maintenance mode" for some time now: sustainable as-is but not something worth extending

I agree they are sustainable as-is but not something worth extending. Still, we did a lot of work to make images well written, linted, tested and documented. The proof is that images are up to date. I think the main blocker is now -- and even if it's quite robust now -- the build that is too monolithic. If we add additional stacks, the build will grow and will become error prone and it will become a pain to publish all the stacks. The build takes between 40 min and 1 hour to complete and if a single thing fails (a test, a package installation) everything fails. I've created an issue for that #1203.

My conviction is that a lot of people are happy with these stacks

  • We have almost 6k stars on the repo and counting
  • We have not so much issues to deal with
  • Stack seems to be used in DockerHub, see below

Based on the notebook shared in #693 I've made a -- quick and dirty -- comparison showing the average number of pulls by day for each image, but it seems obvious that they are used:

JupyterLab

So I will be happy to continue my effort to maintain and improve the stack in a reasonable and measured way as we have always tried to do. @parente I hope you will be a part of it because you have made so much things to achieve this success -- because despite some negative points you highlight it's a success -- and because it is really a pleasure to collaborate with you and learn from you 😍.

@romainx romainx closed this Dec 27, 2020
@parente
Copy link
Member

parente commented Dec 29, 2020

@romainx I don't want to eject and leave you as the sole maintainer here either. I will 100% support you with reviews and merges when @-tagged for as long as I can. I just don't think I'm a good source for inspiration on "what comes next?" as a non-user.

On that note, we should definitely keep eyes on expanding the pool of maintainers, especially if you intend to grow the number of stacks and architectures through build and automation improvements.

@romainx
Copy link
Collaborator

romainx commented Dec 30, 2020

@parente cool this is a very good news! I am relieved 😌.

On that note, we should definitely keeps eyes on expanding the pool of maintainers, especially if you intend to grow the number of stacks and architectures through build and automation improvements.

I'm 100% for expanding the pool of maintainers but my intend is not to grow the number of stacks by changing the orientation of the community stack path. I agree that we cannot maintain here notebooks for a bunch of different technologies. I think we need to stick with this core stack. But event in this scope, some extensions could make sense: supporting other CPU architectures or providing light Sparkmagic images to connect to exiting Spark clusters.

@davidspek
Copy link
Contributor Author

@romainx I have no experience with being a maintainer on any project, but what type of work/commitment does it take and what are the responsibilities? I am mostly busy with Kubeflow and the Notebooks working group is just getting started, so this PR is potentially useful for there instead. However, depending on how things go, it might be an idea to further discuss an "official" tensorflow GPU notebook repo if you are open to it.

@romainx
Copy link
Collaborator

romainx commented Jan 2, 2021

@davidspek Like I said I still think that a tensorflow GPU notebook is a good idea -- it seems to have been confirmed by the figures of the images pull the tensorflow-notebook image is pulled ~ 25k times / day. If your intention is to setup a community stack for this, I can give you a hand. We have a preliminary work to perform on coockie-cutter (#1139 ) it could be an opportunity to do so.

@Atralb
Copy link

Atralb commented Jan 19, 2021

@davidspek Hi there, first of all I want to thank you a lot for your work on this.

I have built the image from the dockerfile in your last commit

sudo docker built -t test .

Here is the full build output

However, when running the container, like this:

sudo nvidia-docker run --rm -it -p 8888:8888 --name test test

And trying to run code from a Keras MNIST example, I get the errors (in the container's stdout):

2021-01-19 03:23:00.528412: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-19 03:23:00.528438: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-19 03:23:01.507122: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-01-19 03:23:01.552140: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-19 03:23:01.552726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:42:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-01-19 03:23:01.552805: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-19 03:23:01.552921: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-19 03:23:01.553650: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-01-19 03:23:01.553856: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-01-19 03:23:01.553951: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-19 03:23:01.554028: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-19 03:23:01.554102: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-19 03:23:01.554114: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-01-19 03:23:02.362750: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-01-19 03:23:02.369782: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3499605000 Hz
2021-01-19 03:23:02.370729: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558eafa33890 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-01-19 03:23:02.370771: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-01-19 03:23:02.372324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-19 03:23:02.372352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      

And the model.fit is run on the CPU.

Also, running tf.test.is_gpu_available() in a notebook cell prints False and gives these errors in the container stdout:

2021-01-19 03:25:58.263465: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-19 03:25:58.264526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:42:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2021-01-19 03:25:58.264708: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-19 03:25:58.264810: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-19 03:25:58.264856: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-01-19 03:25:58.264893: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-01-19 03:25:58.264994: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-19 03:25:58.265088: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-19 03:25:58.265187: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-01-19 03:25:58.265209: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-01-19 03:25:58.265235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-19 03:25:58.265247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 
2021-01-19 03:25:58.265263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N 

Indeed, I checked that the directory /usr/local/nvidia doesn't even exist inside the container.

I tried adding /usr/local/cuda-11.1/targets/x86_64-linux/lib/ to LD_LIBRARY_PATH which didn't solve the issue.

It seems to be looking for files for CUDA 10 (?).

Running other gpu-enabled images work as expected:

sudo nvidia-docker run --rm -it nvidia/cuda:11.2.0-base nvidia-smi
       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:42:00.0  On |                  N/A |
|  0%   55C    P8    21W / 250W |    207MiB / 11016MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Do you know what could be the cause ?

PS: Sorry if this isn't the proper channel to ask this, I don't know where else to do so.

@davidspek
Copy link
Contributor Author

@Atralb I believe I might know what the problem is. However, as this PR is not merged I am not yet sure what will happen with my repo. This also depends what direction we go with the Kubeflow notebook images, which I hope could also serve as a general GPU image for others. To fix the problem with your dockerfile, try adding this code:

# For CUDA profiling, TensorFlow requires CUPTI.
ENV LD_LIBRARY_PATH /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Link the libcuda stub to the location where tensorflow is searching for it and reconfigure
# dynamic linker run-time bindings
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1 \
    && echo "/usr/local/cuda/lib64/stubs" > /etc/ld.so.conf.d/z-cuda-stubs.conf \
    && ldconfig

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tag:Community Stack We recommend this contribution become a community stack.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants