Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable gpu fails on 20.04 (focal) and MicroK8s 1.21 - failed to get sandbox runtime: no runtime for "nvidia" is configured #3226

Closed
nobuto-m opened this issue Jun 8, 2022 · 13 comments
Labels
inactive version/1.21 affects microk8s version 1.21

Comments

@nobuto-m
Copy link
Contributor

nobuto-m commented Jun 8, 2022

Summary

Now that the operator version has been bumped by canonical/microk8s-core-addons#44, I tested the equivalent change against MicroK8s 1.21, but it's still failing. 1.21 is still required for Kubeflow use case and Nvidia's doc states K8s 1.21 is supported with GPU Operator Release 1.10.

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html#kubernetes-platforms

GPU Operator Release Kubernetes
1.10 v1.21+
microk8s (1.21 *=)$ git diff
diff --git a/microk8s-resources/actions/enable.gpu.sh b/microk8s-resources/actions/enable.gpu.sh
index 0da168a..3a6965f 100755
--- a/microk8s-resources/actions/enable.gpu.sh
+++ b/microk8s-resources/actions/enable.gpu.sh
@@ -17,11 +17,16 @@ sudo touch ${SNAP_DATA}/var/lock/gpu
 
 echo "Installing NVIDIA Operator"
 
+ENABLE_INTERNAL_DRIVER="true"
+
 "$SNAP/microk8s-helm3.wrapper" repo add nvidia https://nvidia.github.io/gpu-operator
 "$SNAP/microk8s-helm3.wrapper" repo update
 "$SNAP/microk8s-helm3.wrapper" install gpu-operator nvidia/gpu-operator \
+  --create-namespace \
+  --namespace gpu-operator-resources \
+  --version=v1.10.1 \
   --set operator.defaultRuntime=containerd \
-  --set toolkit.version=1.4.4-ubuntu18.04 \
+  --set driver.enabled=$ENABLE_INTERNAL_DRIVER \
   --set toolkit.env[0].name=CONTAINERD_CONFIG \
   --set toolkit.env[0].value=$CONFIG \
   --set toolkit.env[1].name=CONTAINERD_SOCKET \
$ microk8s kubectl -n gpu-operator-resources get pod
NAME                                                          READY   STATUS     RESTARTS   AGE
nvidia-operator-validator-hkmtq                               0/1     Init:0/4   0          4m55s
nvidia-device-plugin-daemonset-xjfvl                          0/1     Init:0/1   0          4m55s
nvidia-dcgm-exporter-pxclq                                    0/1     Init:0/1   0          4m55s
gpu-feature-discovery-42qmn                                   0/1     Init:0/1   0          4m54s
nvidia-container-toolkit-daemonset-k75gp                      0/1     Init:0/1   1          4m55s
gpu-operator-node-feature-discovery-worker-shpx8              1/1     Running    1          5m27s
nvidia-driver-daemonset-5kb5h                                 1/1     Running    1          4m55s
gpu-operator-794b8c8ddc-t9gwf                                 1/1     Running    1          5m27s
gpu-operator-node-feature-discovery-master-57f77d46f9-5c4dl   1/1     Running    1          5m27s
  Warning  FailedCreatePodSandBox  5s (x14 over 3m)  kubelet            
Failed to create pod sandbox: rpc error: code = Unknown desc = failed 
to get sandbox runtime: no runtime for "nvidia" is configured

What Should Happen Instead?

All operator related pods are up and running.

Reproduction Steps

  1. prepare Ubuntu 20.04 LTS installation with Nvidia GPU
  2. install microk8s 1.21
  3. run a patched enable.gpu.sh
    snap run --shell microk8s
    bash -eux ./enable.gpu.sh
$ snap version
snap    2.54.4
snapd   2.54.4
series  16
ubuntu  20.04
kernel  5.13.0-1022-aws

$ snap list microk8s
Name      Version   Rev   Tracking     Publisher   Notes
microk8s  v1.21.12  3202  1.21/stable  canonical✓  classic

$ lspci | grep NVIDIA
00:1e.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)

Introspection Report

inspection-report-20220608_072258.tar.gz

Can you suggest a fix?

Are you interested in contributing with a fix?

@nobuto-m
Copy link
Contributor Author

nobuto-m commented Jun 8, 2022

I think a previously opened issue in #2575 might have the same root cause, but at that point a solution was to use 1.22, which doesn't support Kubeflow in this case.

@debimishra89
Copy link

debimishra89 commented Jun 8, 2022

facing the same issue. any workaround for 1.21? I can't upgrade to 1.22 because KubeFlow is not supported

@debimishra89
Copy link

@ktsakalozos , Any workaround available for 1.21

@AlexsJones AlexsJones added the version/1.21 affects microk8s version 1.21 label Jun 8, 2022
@ktsakalozos
Copy link
Member

The nvidia operator will not deploy cleanly on 1.21. One of the reasons I am aware of is that containerd daemon is configured to be of type simple [1]. Considering how close to end of support 1.21 is and the risk involved in changing the daemon's behavior, it is unlikely we will get this issue addressed.

[1] https://snapcraft.io/docs/services-and-daemons

@nobuto-m
Copy link
Contributor Author

nobuto-m commented Jun 8, 2022

Ah, okay. You are talking about this change.

@ktsakalozos
Copy link
Member

Ah, okay. You are talking about this change.

Yes, this is the top level patch

@nobuto-m
Copy link
Contributor Author

nobuto-m commented Jun 9, 2022

Okay, at least the theory has been confirmed; the patchset below can make GPU enablement step work.
1.21...nobuto-m:1.21-gpu

@debimishra89
Copy link

Okay, at least the theory has been confirmed; the patchset below can make GPU enablement step work. 1.21...nobuto-m:1.21-gpu

@nobuto-m , Can you pls help me how to use this patch. I am not able to modify files inside snap directory

@nobuto-m
Copy link
Contributor Author

Okay, at least the theory has been confirmed; the patchset below can make GPU enablement step work. 1.21...nobuto-m:1.21-gpu

@nobuto-m , Can you pls help me how to use this patch. I am not able to modify files inside snap directory

You can grab a custom built snap here (only for testing):
https://people.canonical.com/~nobuto/microk8s/microk8s_v1.21.13_amd64_60d55f8.snap
and install it by snap install --dangerous --classic ./microk8s_v1.21.13_amd64_60d55f8.snap or something like that. However, please note that the build was just for confirming the theory, not for general consumption at all nor supported. Also, as @ktsakalozos stated above, it's unlikely to be released for 1.21.

@debimishra89
Copy link

build is failing with below error

ubuntu@dasec-node2:~/microk8s-master$ sudo snapcraft
Running with 'sudo' may cause permission errors and is discouraged. Use 'sudo' when cleaning.
Launching a VM.
Skipping pull bash-utils (already ran)
Skipping pull cluster-agent (already ran)
Skipping pull cni (already ran)
Skipping pull libmnl (already ran)
Skipping pull libnftnl (already ran)
Skipping pull iptables (already ran)
Skipping pull runc (already ran)
'containerd' has dependencies that need to be staged: runc
Skipping pull runc (already ran)
Building runc

  • set -eux
  • . /root/parts/runc/src/set-env-variables.sh
  • set -eu
    /bin/sh: 2: set: Illegal option -
    Failed to run 'override-build': Exit code was 2.

@neoaggelos
Copy link
Contributor

@debimishra89 Note that the GPU addon is broken in 1.21. MicroK8s 1.21 is also out of support, so we have no immediate plans to change it.

However, it is possible to enable GPU support using host drivers and container runtime. On a fresh installation of MicroK8s 1.21, you can follow the instructions below (also in this gist):


# 1. install microk8s and enable required addons
sudo snap install microk8s --classic --channel 1.21
microk8s enable dns
microk8s enable helm3

# 2. install nvidia drivers
sudo apt-get install nvidia-headless-510-server nvidia-utils-510-server

# 3. ensure nvidia drivers are loaded
if ! nvidia-smi -L; then
  echo "No compatible GPU found"
fi

# 4. install nvidia-container-runtime
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update
sudo apt-get install -y nvidia-container-runtime

# 5. configure and restart containerd
echo '
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
' | sudo tee -a /var/snap/microk8s/current/args/containerd-template.toml
sudo snap restart microk8s.daemon-containerd

# 6. install GPU operator
sudo microk8s helm3 repo add nvidia https://nvidia.github.io/gpu-operator
sudo microk8s helm3 install gpu-operator nvidia/gpu-operator \
  --create-namespace -n gpu-operator-resources \
  --set driver.enabled=false,toolkit.enabled=false
  
# 7. wait for validations to complete
while ! sudo microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator | grep "all validations are successful"
do
  echo "Waiting for GPU addon"
  sleep 5
done

PS. We are in the process of updating the documentation page accordingly.

@stale
Copy link

stale bot commented May 25, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the inactive label May 25, 2023
@neoaggelos
Copy link
Contributor

Documentation note has been added in https://microk8s.io/docs/addon-gpu#microk8s-121-12, closing the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inactive version/1.21 affects microk8s version 1.21
Projects
None yet
Development

No branches or pull requests

5 participants