Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] simplify cuda guide #669

Merged
merged 1 commit into from
Jul 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 23 additions & 38 deletions docs/usage/guides/cuda.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,22 +11,15 @@ To get the NVIDIA container runtime in the K3s image you need to build your own
The native K3s image is based on Alpine but the NVIDIA container runtime is not supported on Alpine yet.
To get around this we need to build the image with a supported base image.

### Dockerfiles

[Dockerfile.base](cuda/Dockerfile.base):
### Dockerfile

```Dockerfile
{% include "cuda/Dockerfile.base" %}

```

[Dockerfile.k3d-gpu](cuda/Dockerfile.k3d-gpu):
[Dockerfile](cuda/Dockerfile):

```Dockerfile
{% include "cuda/Dockerfile.k3d-gpu" %}
{% include "cuda/Dockerfile" %}
```

These Dockerfiles are based on the [K3s Dockerfile](https://github.com/rancher/k3s/blob/master/package/Dockerfile)
This Dockerfile is based on the [K3s Dockerfile](https://github.com/rancher/k3s/blob/master/package/Dockerfile)
The following changes are applied:

1. Change the base images to nvidia/cuda:11.2.0-base-ubuntu18.04 so the NVIDIA Container Runtime can be installed. The version of `cuda:xx.x.x` must match the one you're planning to use.
Expand All @@ -50,7 +43,7 @@ To enable NVIDIA GPU support on Kubernetes you also need to install the [NVIDIA
* Run GPU enabled containers in your Kubernetes cluster.

```yaml
{% include "cuda/gpu.yaml" %}
{% include "cuda/device-plugin-daemonset.yaml" %}
```

### Build the K3s image
Expand All @@ -59,57 +52,49 @@ To build the custom image we need to build K3s because we need the generated out

Put the following files in a directory:

* [Dockerfile.base](cuda/Dockerfile.base)
* [Dockerfile.k3d-gpu](cuda/Dockerfile.k3d-gpu)
* [Dockerfile](cuda/Dockerfile)
* [config.toml.tmpl](cuda/config.toml.tmpl)
* [gpu.yaml](cuda/gpu.yaml)
* [device-plugin-daemonset.yaml](cuda/device-plugin-daemonset.yaml)
* [build.sh](cuda/build.sh)
* [cuda-vector-add.yaml](cuda/cuda-vector-add.yaml)

The `build.sh` script is configured using exports & defaults to `v1.21.2+k3s1`. Please set your CI_REGISTRY_IMAGE! The script performs the following steps:

* pulls K3s
* builds K3s
* build the custom K3D Docker image

The resulting image is tagged as k3s-gpu:<version tag>. The version tag is the git tag but the '+' sign is replaced with a '-'.
The `build.sh` script is configured using exports & defaults to `v1.21.2+k3s1`. Please set at least the `IMAGE_REGISTRY` variable! The script performs the following steps builds the custom K3s image including the nvidia drivers.

[build.sh](cuda/build.sh):

```bash
{% include "cuda/build.sh" %}
```

## Run and test the custom image with Docker
## Run and test the custom image with k3d

You can run a container based on the new image with Docker:
You can use the image with k3d:

```bash
docker run --name k3s-gpu -d --privileged --gpus all $CI_REGISTRY_IMAGE:$IMAGE_TAG
k3d cluster create gputest --image=$IMAGE --gpus=1
```

Deploy a [test pod](cuda/cuda-vector-add.yaml):

```bash
docker cp cuda-vector-add.yaml k3s-gpu:/cuda-vector-add.yaml
docker exec k3s-gpu kubectl apply -f /cuda-vector-add.yaml
docker exec k3s-gpu kubectl logs cuda-vector-add
kubectl apply -f cuda-vector-add.yaml
kubectl logs cuda-vector-add
```

## Run and test the custom image with k3d

Tou can use the image with k3d:
This should output something like the following:

```bash
k3d cluster create local --image=$CI_REGISTRY_IMAGE:$IMAGE_TAG --gpus=1
$ kubectl logs cuda-vector-add

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

Deploy a [test pod](cuda/cuda-vector-add.yaml):

```bash
kubectl apply -f cuda-vector-add.yaml
kubectl logs cuda-vector-add
```
If the `cuda-vector-add` pod is stuck in `Pending` state, probably the device-driver daemonset didn't get deployed correctly from the auto-deploy manifests. In that case, you can apply it manually via `#!bash kubectl apply -f device-plugin-daemonset.yaml`.

## Known issues

Expand Down
47 changes: 47 additions & 0 deletions docs/usage/guides/cuda/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
ARG K3S_TAG="v1.21.2-k3s1"
FROM rancher/k3s:$K3S_TAG as k3s

FROM nvidia/cuda:11.2.0-base-ubuntu18.04

ARG NVIDIA_CONTAINER_RUNTIME_VERSION
ENV NVIDIA_CONTAINER_RUNTIME_VERSION=$NVIDIA_CONTAINER_RUNTIME_VERSION

RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

RUN apt-get update && \
apt-get -y install gnupg2 curl

# Install NVIDIA Container Runtime
RUN curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | apt-key add -

RUN curl -s -L https://nvidia.github.io/nvidia-container-runtime/ubuntu18.04/nvidia-container-runtime.list | tee /etc/apt/sources.list.d/nvidia-container-runtime.list

RUN apt-get update && \
apt-get -y install nvidia-container-runtime=${NVIDIA_CONTAINER_RUNTIME_VERSION}

COPY --from=k3s / /

RUN mkdir -p /etc && \
echo 'hosts: files dns' > /etc/nsswitch.conf

RUN chmod 1777 /tmp

# Provide custom containerd configuration to configure the nvidia-container-runtime
RUN mkdir -p /var/lib/rancher/k3s/agent/etc/containerd/

COPY config.toml.tmpl /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl

# Deploy the nvidia driver plugin on startup
RUN mkdir -p /var/lib/rancher/k3s/server/manifests

COPY device-plugin-daemonset.yaml /var/lib/rancher/k3s/server/manifests/nvidia-device-plugin-daemonset.yaml

VOLUME /var/lib/kubelet
VOLUME /var/lib/rancher/k3s
VOLUME /var/lib/cni
VOLUME /var/log

ENV PATH="$PATH:/bin/aux"

ENTRYPOINT ["/bin/k3s"]
CMD ["agent"]
32 changes: 0 additions & 32 deletions docs/usage/guides/cuda/Dockerfile.base

This file was deleted.

72 changes: 0 additions & 72 deletions docs/usage/guides/cuda/Dockerfile.k3d-gpu

This file was deleted.

39 changes: 15 additions & 24 deletions docs/usage/guides/cuda/build.sh
Original file line number Diff line number Diff line change
@@ -1,30 +1,21 @@
#!/bin/bash

export CI_REGISTRY_IMAGE="YOUR_REGISTRY_IMAGE_URL"
export VERSION="1.0"
export K3S_TAG="v1.21.2+k3s1"
export DOCKER_VERSION="20.10.7"
export IMAGE_TAG="v1.21.2-k3s1"
export NVIDIA_CONTAINER_RUNTIME_VERSION="3.5.0-1"
set -euxo pipefail

docker build -f Dockerfile.base --build-arg DOCKER_VERSION=$DOCKER_VERSION -t $CI_REGISTRY_IMAGE/base:$VERSION . && \
docker push $CI_REGISTRY_IMAGE/base:$VERSION
K3S_TAG=${K3S_TAG:="v1.21.2-k3s1"} # replace + with -, if needed
IMAGE_REGISTRY=${IMAGE_REGISTRY:="MY_REGISTRY"}
IMAGE_REPOSITORY=${IMAGE_REPOSITORY:="rancher/k3s"}
IMAGE_TAG="$K3S_TAG-cuda"
IMAGE=${IMAGE:="$IMAGE_REGISTRY/$IMAGE_REPOSITORY:$IMAGE_TAG"}

rm -rf ./k3s && \
git clone --depth 1 https://github.com/rancher/k3s.git -b "$K3S_TAG" && \
docker run -ti -v ${PWD}/k3s:/k3s -v /var/run/docker.sock:/var/run/docker.sock $CI_REGISTRY_IMAGE/base:1.0 sh -c "cd /k3s && make" && \
ls -al k3s/build/out/data.tar.zst
NVIDIA_CONTAINER_RUNTIME_VERSION=${NVIDIA_CONTAINER_RUNTIME_VERSION:="3.5.0-1"}

if [ -f k3s/build/out/data.tar.zst ]; then
echo "File exists! Building!"
docker build -f Dockerfile.k3d-gpu \
--build-arg NVIDIA_CONTAINER_RUNTIME_VERSION=$NVIDIA_CONTAINER_RUNTIME_VERSION\
-t $CI_REGISTRY_IMAGE:$IMAGE_TAG . && \
docker push $CI_REGISTRY_IMAGE:$IMAGE_TAG
echo "Done!"
else
echo "Error, file does not exist!"
exit 1
fi
echo "IMAGE=$IMAGE"

docker build -t $CI_REGISTRY_IMAGE:$IMAGE_TAG .
# due to some unknown reason, copying symlinks fails with buildkit enabled
DOCKER_BUILDKIT=0 docker build \
--build-arg K3S_TAG=$K3S_TAG \
--build-arg NVIDIA_CONTAINER_RUNTIME_VERSION=$NVIDIA_CONTAINER_RUNTIME_VERSION \
-t $IMAGE .
docker push $IMAGE
echo "Done!"