You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When trying to setup nvshare on GKE, installation goes fine and scheduling pods (e.g. the test pods from the README or a simple cuda pod that runs nvidia-smi) goes fine. nvshare.com/gpu gets consumed. However, pods error with nvidia-smi is not found or in the case of e.g. from the pytorch small pod: Traceback (most recent call last): File "/pytorch-add-small.py", line 29, in <module> device = torch.cuda.current_device() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 479, in current_device _lazy_init() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init torch._C._cuda_init() RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
When scheduling the pod by requesting nvidia.com/gpu, the GPU is visible and the drivers + nvidia-smi are available.
Add the following env var to nvshare-device-plugin daemonset as GKE's gpu-device-plugin does not expose this env var and nvshare-device-plugin depends on it:
- name: NVIDIA_VISIBLE_DEVICES
value: "0"
Optional: Add the following affinity to nvshare-device-plugin daemonset so that pods only get scheduled on GPU nodes.
Expectation when checking logs: Get GPU information GPU 0: NVIDIA L4 (UUID: GPU-7e0c893c-3254-dfa8-db40-73942c3de761) (This is what you see when scheduling with a nvidia.com/gpu request)
Actual output: bash: /usr/local/nvidia/bin/nvidia-smi: No such file or directory
The text was updated successfully, but these errors were encountered:
Unfortunately, nvshare currently strictly depends on NVIDIA's upstream K8s device plugin [1].
This is because nvshare's implementation is strictly coupled with NVIDIA's container runtime.
When I have some time next week, I will elaborate on this fully.
A short summary is that nvshare-device-plugin sets the NVIDIA_VISIBLE_DEVICES environment var (or its symbolic /dev/null mount alternative) in containers that request nvshare.com/gpu. NVIDIA's container runtime, which is a runc hook that runs the containers on the node reads this environment variable and mounts the necessary files (libaries, device nodes, binaries [such as nvidia-smi]) into the container.
Without NVIDIA's device plugin, containers requesting a nvshare.com/gpu device will not see the device exposed when running and will fail.
TL;DR:
For the time being, nvidia-device-plugin [1] is a strict prerequisite for operatingnvshare on Kubernetes.
When trying to setup nvshare on GKE, installation goes fine and scheduling pods (e.g. the test pods from the README or a simple cuda pod that runs
nvidia-smi
) goes fine.nvshare.com/gpu
gets consumed. However, pods error withnvidia-smi
is not found or in the case of e.g. from the pytorch small pod:Traceback (most recent call last): File "/pytorch-add-small.py", line 29, in <module> device = torch.cuda.current_device() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 479, in current_device _lazy_init() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init torch._C._cuda_init() RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
When scheduling the pod by requesting
nvidia.com/gpu
, the GPU is visible and the drivers + nvidia-smi are available.Setup:
GKE k8s version: 1.25.10-gke.2700
nvidia-gpu-device-plugin: GKE's own GPU device plugin
How to reproduce:
nvshare-device-plugin
daemonset as GKE's gpu-device-plugin does not expose this env var and nvshare-device-plugin depends on it:nvshare-device-plugin
daemonset so that pods only get scheduled on GPU nodes.nvshare.com/gpu
resource:Expectation when checking logs: Get GPU information
GPU 0: NVIDIA L4 (UUID: GPU-7e0c893c-3254-dfa8-db40-73942c3de761)
(This is what you see when scheduling with anvidia.com/gpu
request)Actual output:
bash: /usr/local/nvidia/bin/nvidia-smi: No such file or directory
The text was updated successfully, but these errors were encountered: