Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nvidia based notebooks to stack #1196

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ ALL_STACKS:=base-notebook \
tensorflow-notebook \
datascience-notebook \
pyspark-notebook \
all-spark-notebook
all-spark-notebook \
tensorflow-notebook-gpu
endif

ALL_IMAGES:=$(ALL_STACKS)
Expand Down
2 changes: 2 additions & 0 deletions tensorflow-notebook-gpu/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Documentation
README.md
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add hooks and test dir like in the PR #1201

81 changes: 81 additions & 0 deletions tensorflow-notebook-gpu/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Copyright (c) Jupyter Development Team.
# Distributed under the terms of the Modified BSD License.
ARG BASE_CONTAINER=jupyter/scipy-notebook
FROM $BASE_CONTAINER

LABEL maintainer="Jupyter Project <jupyter@googlegroups.com>"

ARG CUDA=11.1
ARG CUDNN=8.0.4.30-1
ARG CUDNN_MAJOR_VERSION=8
ARG LIB_DIR_PREFIX=x86_64
ARG LIBNVINFER=7.2.1-1
ARG LIBNVINFER_MAJOR_VERSION=7

# Fix DL4006
SHELL ["/bin/bash", "-o", "pipefail", "-c"]

USER root

RUN apt-get update && apt-get install -y --no-install-recommends \
gnupg2 curl ca-certificates && \
curl -fsSL https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | apt-key add - && \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use wget that is already available instead of installing curl, it should be easier.

echo "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list && \
echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use 20.04 for quite a while. And, as I see here, ubuntu2004 repo is available.
The same goes about other repos as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason the 18.04 repo is used is because not all the packages seem to be present in the 20.04 repo. I am using the tensorflow Dockerfiles as an example (which use 18.04) for this Dockerfile, and if I remember correctly the following package cannot be found on that repository.

libcudnn8=${CUDNN}+cuda${CUDA}
libnvinfer${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA}
libnvinfer-plugin${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8_8.0.5.39-1+cuda11.1_amd64.deb exists.
Could you please check, if your info is up-to-date?

If something is missing, then I guess it's ok to use old repos.
But it's worth adding a comment with the packages missing in newer ubuntu version repo.

apt-get purge --autoremove -y curl \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice!
We should probably delete so called "build requirements" in other packages as well.
Could you create an issue please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1199
Actually, I just noticed this line is stupid here as on line 61 curl gets installed again haha.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you can delete it there as well 👍

&& rm -rf /var/lib/apt/lists/*

ENV CUDA_VERSION 11.1.1

# For libraries in the cuda-compat-* package: https://docs.nvidia.com/cuda/eula/index.html#attachment-a
RUN apt-get update && apt-get install -y --no-install-recommends \
cuda-cudart-11-1=11.1.74-1 \
cuda-compat-11-1 \
&& ln -s cuda-11.1 /usr/local/cuda && \
Comment on lines +32 to +34
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create an environment variable for the versions here or use the existing ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it then not also be a good idea to use environment variables for the tensorflow version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not, because it is not duplicated many times.
And creating variables for each conan/pip package would make some Dockerfiles harder to read.

rm -rf /var/lib/apt/lists/*

# Required for nvidia-docker v1
RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf

ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64

# nvidia-container-runtime
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV NVIDIA_REQUIRE_CUDA "cuda>=11.1 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 brand=tesla,driver>=450,driver<451"

# Install all OS dependencies for notebook server that starts but lacks all
# features (e.g., download as all possible file formats)
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update \
&& apt-get install -yq --no-install-recommends \
cuda-command-line-tools-${CUDA/./-} \
libcublas-${CUDA/./-} \
cuda-nvrtc-${CUDA/./-} \
libcufft-${CUDA/./-} \
libcurand-${CUDA/./-} \
libcusolver-${CUDA/./-} \
libcusparse-${CUDA/./-} \
curl \
libcudnn8=${CUDNN}+cuda${CUDA} \
libfreetype6-dev \
libhdf5-serial-dev \
libzmq3-dev \
pkg-config \
software-properties-common \
cm-super \
libnvinfer${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA} \
libnvinfer-plugin${LIBNVINFER_MAJOR_VERSION}=${LIBNVINFER}+cuda${CUDA} \
&& apt-get clean && rm -rf /var/lib/apt/lists/*

USER $NB_UID

WORKDIR $HOME

# Install Tensorflow
RUN pip install --quiet --no-cache-dir \
'tensorflow-gpu==2.3.1' && \
fix-permissions "${CONDA_DIR}" && \
fix-permissions "/home/${NB_USER}"
15 changes: 15 additions & 0 deletions tensorflow-notebook-gpu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
[![docker pulls](https://img.shields.io/docker/pulls/jupyter/tensorflow-notebook.svg)](https://hub.docker.com/r/jupyter/tensorflow-notebook/)
[![docker stars](https://img.shields.io/docker/stars/jupyter/tensorflow-notebook.svg)](https://hub.docker.com/r/jupyter/tensorflow-notebook/)
[![image metadata](https://images.microbadger.com/badges/image/jupyter/tensorflow-notebook.svg)](https://microbadger.com/images/jupyter/tensorflow-notebook "jupyter/tensorflow-notebook image metadata")

# Jupyter Notebook Deep Learning Stack

GitHub Actions in the https://github.com/jupyter/docker-stacks project builds and pushes this image
to Docker Hub.

Please visit the project documentation site for help using and contributing to this image and
others.

- [Jupyter Docker Stacks on ReadTheDocs](http://jupyter-docker-stacks.readthedocs.io/en/latest/index.html)
- [Selecting an Image :: Core Stacks :: jupyter/tensorflow-notebook](http://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#jupyter-tensorflow-notebook)
- [Image Specifics :: Tensorflow](http://jupyter-docker-stacks.readthedocs.io/en/latest/using/specifics.html#tensorflow)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely add, which GPUs are supported.
It should be either list of GPUs or some kind of requirements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main requirement is CUDA and NVIDIA drivers must be installed on the host. As for GPUs, I think anything released in the last 10 years will work. If I look at the cuda-gpus page the lowest entry is the GeForce GT 430. Arguably, I think anybody looking to use tensorflow with GPUs will have a more modern GPU and (hopefully) have spent the time to make a good decision what GPU to get for their ML work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having first link with the comment about host driver is fine for me.

45 changes: 45 additions & 0 deletions tensorflow-notebook-gpu/hooks/run_hook
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
#!/bin/bash
set -e

# Apply tags
GIT_SHA_TAG=${GITHUB_SHA:0:12}
docker tag $IMAGE_NAME "$DOCKER_REPO:$GIT_SHA_TAG"

# Update index
INDEX_ROW="|\`${BUILD_TIMESTAMP}\`|\`jupyter/${IMAGE_SHORT_NAME}:${GIT_SHA_TAG}\`|[Git diff](https://github.com/jupyter/docker-stacks/commit/${GITHUB_SHA})<br />[Dockerfile](https://github.com/jupyter/docker-stacks/blob/${GITHUB_SHA}/${IMAGE_SHORT_NAME}/Dockerfile)<br />[Build manifest](./${IMAGE_SHORT_NAME}-${GIT_SHA_TAG})|"
sed "/|-|/a ${INDEX_ROW}" -i "${WIKI_PATH}/Home.md"

# Build manifest
MANIFEST_FILE="${WIKI_PATH}/manifests/${IMAGE_SHORT_NAME}-${GIT_SHA_TAG}.md"
mkdir -p $(dirname "$MANIFEST_FILE")

cat << EOF > "$MANIFEST_FILE"
* Build datetime: ${BUILD_TIMESTAMP}
* Docker image: ${DOCKER_REPO}:${GIT_SHA_TAG}
* Docker image size: $(docker images ${IMAGE_NAME} --format "{{.Size}}")
* Git commit SHA: [${GITHUB_SHA}](https://github.com/jupyter/docker-stacks/commit/${GITHUB_SHA})
* Git commit message:
\`\`\`
${COMMIT_MSG}
\`\`\`

## Python Packages

\`\`\`
$(docker run --rm ${IMAGE_NAME} python --version)
\`\`\`

\`\`\`
$(docker run --rm ${IMAGE_NAME} conda info)
\`\`\`

\`\`\`
$(docker run --rm ${IMAGE_NAME} conda list)
\`\`\`

## Apt Packages

\`\`\`
$(docker run --rm ${IMAGE_NAME} apt list --installed)
\`\`\`
EOF
Copy link
Member

@mathbunnyru mathbunnyru Dec 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shold probably add some gpu specific info here.
I think several images, which do have some specific stuff, have a little bit different run_hook files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this as well. Probably would be good to have an image tag with the CUDA version as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

30 changes: 30 additions & 0 deletions tensorflow-notebook-gpu/test/test_tensorflow.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright (c) Jupyter Development Team.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to check some gpu specific stuff here?

As far as I understand, this file is the same as tensorflow-notebook/test/test_tensorflow.py, so it might be worth to check something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree it would be good to test GPU stuff here, I'm not sure it will be possible to test functional stuff. For example, what I noticed when I created a non-functional GPU docker image is that nvidia-smi ran without a problem and I could see the GPU info, however, for CUDA version it would say N/A and as such the image was not usable. Doing a test like that here, or something like torch.cuda.is_available() in pytorch would be great, but it would require a GPU on the machine running the tests.
Do you have an idea of what type of tests could be done without requiring a GPU?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's tricky, yes.
To be honest, I'm not familiar with GPUs at all, so no ideas so far.

Copy link
Member

@mathbunnyru mathbunnyru Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I say we just run nvidia-smi, as you did and check that error code is zero.
That's not perfect, but better than nothing.

# Distributed under the terms of the Modified BSD License.
import logging

import pytest

LOGGER = logging.getLogger(__name__)


@pytest.mark.parametrize(
"name,command",
[
(
"Hello world",
"import tensorflow as tf;print(tf.constant('Hello, TensorFlow'))",
),
(
"Sum",
"import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))",
),
],
)
def test_tensorflow(container, name, command):
"""Basic tensorflow tests"""
LOGGER.info(f"Testing tensorflow: {name} ...")
c = container.run(tty=True, command=["start.sh", "python", "-c", command])
rv = c.wait(timeout=30)
assert rv == 0 or rv["StatusCode"] == 0, f"Command {command} failed"
logs = c.logs(stdout=True).decode("utf-8")
LOGGER.debug(logs)