-
Notifications
You must be signed in to change notification settings - Fork 2.4k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latest NVIDIA Container Runtime Support not working anymore with K3S #8248
Comments
I had this same issue and I was able to fix it by applying the changes from https://github.com/NVIDIA/k8s-device-plugin#configure-containerd in a config.toml.tmpl based on the format here: https://github.com/k3s-io/k3s/blob/master/pkg/agent/templates/templates_linux.go. That also included removing the default nvidia plugin detection in the template (which could probably be brought back to fit with the correct config). Here's the diff:
I restarted k3s and I also had to delete the nvidia-device-plugin-daemonset pod: After that it stopped showing:
And logged:
One thing to be aware of that I'm still checking on is that after a reboot, all of my kube-system pods started to fail with CrashLoopBackoff. I found that other people had an issue linked with the Cgroup line in #5454. I confirmed that removing the nvidia config from the config.toml.tmpl file stops the CrashLoopBackoff condition but I'm still not entirely sure why. edit: Note, after adding the SystemdCgroup line to the nvidia runtime option section, my containers stopped crashing:
|
It sounds like the main difference here is just that we need to set Do you know which release of the nvidia container runtime started requiring this? |
Relevant issue: NVIDIA/k8s-device-plugin#406 |
After trying out all the suggestions from here and other issues, I got it working by following this blog https://medium.com/sparque-labs/serving-ai-models-on-the-edge-using-nvidia-gpu-with-k3s-on-aws-part-4-dd48f8699116 |
That link gives me HTTP 404 However, I have solved the The reason is that the k3s detects the nvidia container runtime, but it does not make it the default one. The Helm chart, or the |
not work, |
@xinmans Try applying this manifest:
And re-create the nvidia plugin. Relevant: NVIDIA/k8s-device-plugin#406 (comment) |
There's a dot at the end of the URL for some reason, that needs to be removed. In any case, the mentioned article uses the GPU operator which in turn uses the operator framework which automates this whole process. It did immediately work for me, ymmv. https://github.com/NVIDIA/gpu-operator Using helm:
|
@henryford my bad, I updated the medium article link. Good to see that you got it working. |
I cannot get K38s to recognize my GPU. I have followed the official docs, and my config.toml lists the
But checking for GPU availability on my node I get:
and any pod intialized with GPU remains in Notes/additional questions:
|
I'm going to convert this to a discussion, as it seems like a K8s/NVIDIA related issue, rather than a k3s bug |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Environmental Info:
K3s Version: v1.27.4+k3s1
Node(s) CPU architecture, OS, and Version:
169092810522.04~d567a38 SMP PREEMPT_DYNAMIC Tue A x86_64 x86_64 x86_64 GNU/LinuxCluster Configuration:
1 Server, 1 agent
Describe the bug:
Nvidia device plugin pod with crashloopbackoff, unable to detect GPU.
The documentation to enable GPU workload doesn't work anymore when using latest nvidia drivers (535) and Nvidia runtime toolkit (1.13.5) here https://docs.k3s.io/advanced?_highlight=nvidia#nvidia-container-runtime-support
Steps To Reproduce:
Note: I installed both with and without base because I wasn't sure how to proceed regarding CDI support in K3S
Note: I have restarted k3s-agent just in case
Note: there are additional containerd instructions required here which I didn't follow https://github.com/NVIDIA/k8s-device-plugin#configure-containerd
Expected behavior:
Expecting kubectl describe node gpu1 detecting GPU specification and adding annotation
Actual behavior:
the node gpu1 not showing any GPU related component. I didn't run the nbody-gpu-benchmark pod to test, given the limit resource specification nbody-gpu-benchmark
Additional context / logs:
The K3S documentation for Nvidia runtime https://docs.k3s.io/advanced?_highlight=nvidia#nvidia-container-runtime-support describes a working solution using driver 515.
I used this approach successfully until now (with k3s. v1.24, NFD v.013 and gpu-feature-discovery) but I have recently upgraded my GPU and installed newer driver version 535 for compatibility. Also reinstalled k3s v1.27.4+k3s1 in the process
Ideas for resolution
it could be a regression issue by using latest nvidia driver 535 but haven't tested out yet, given how long it would take to downgrade and test out.
There are additional instruction for containerd configuration with runtime described in Nvidia device plugin which I didn't follow. https://github.com/NVIDIA/k8s-device-plugin#configure-containerd
Shall I define them in config.toml.tmpl ?
There is now CDI https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#step-2-generate-a-cdi-specification but no instruction for containerd, even less for k3s.
Not sure if this is on K3S or Nvidia side, looking forward to hearing your feedback
Thank you in advance
Jean-Paul
The text was updated successfully, but these errors were encountered: