GKE: Pods cannot access/detect GPU device + drive on GPU nodes #6

cjidboon94 · 2023-07-27T13:36:22Z

When trying to setup nvshare on GKE, installation goes fine and scheduling pods (e.g. the test pods from the README or a simple cuda pod that runs nvidia-smi) goes fine. nvshare.com/gpu gets consumed. However, pods error with nvidia-smi is not found or in the case of e.g. from the pytorch small pod:
Traceback (most recent call last): File "/pytorch-add-small.py", line 29, in <module> device = torch.cuda.current_device() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 479, in current_device _lazy_init() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init torch._C._cuda_init() RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

When scheduling the pod by requesting nvidia.com/gpu, the GPU is visible and the drivers + nvidia-smi are available.

Setup:
GKE k8s version: 1.25.10-gke.2700
nvidia-gpu-device-plugin: GKE's own GPU device plugin

How to reproduce:

Install nvidia-driver-installer daemonset to acquire drivers on nodes as per https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers
Install nvshare daemonsets according to readme
Add the following env var to nvshare-device-plugin daemonset as GKE's gpu-device-plugin does not expose this env var and nvshare-device-plugin depends on it:

       - name: NVIDIA_VISIBLE_DEVICES
         value: "0"

Optional: Add the following affinity to nvshare-device-plugin daemonset so that pods only get scheduled on GPU nodes.

  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: cloud.google.com/gke-accelerator
            operator: Exists

Deploy a pod that requests an nvshare.com/gpu resource:

apiVersion: v1
kind: Pod 
metadata:
  name: test-nvshare
spec:
  restartPolicy: OnFailure\
  tolerations:
  - key: nvidia.com/gpu
    effect: NoSchedule
    operator: Exists
  containers:
      - name: test-nvshare
      	env:
      	- name: NVSHARE_DEBUG
      	  value: "1"
        image: nvidia/cuda:11.0.3-base-ubi7
        command:
        - bash
        - -c
        - |
          /usr/local/nvidia/bin/nvidia-smi -L; sleep 300
        resources:
          limits:
            nvshare.com/gpu: 1

Expectation when checking logs: Get GPU information GPU 0: NVIDIA L4 (UUID: GPU-7e0c893c-3254-dfa8-db40-73942c3de761) (This is what you see when scheduling with a nvidia.com/gpu request)
Actual output: bash: /usr/local/nvidia/bin/nvidia-smi: No such file or directory

The text was updated successfully, but these errors were encountered:

grgalex · 2023-07-27T15:42:29Z

@cjidboon94

Unfortunately, nvshare currently strictly depends on NVIDIA's upstream K8s device plugin [1].

This is because nvshare's implementation is strictly coupled with NVIDIA's container runtime.

When I have some time next week, I will elaborate on this fully.

A short summary is that nvshare-device-plugin sets the NVIDIA_VISIBLE_DEVICES environment var (or its symbolic /dev/null mount alternative) in containers that request nvshare.com/gpu. NVIDIA's container runtime, which is a runc hook that runs the containers on the node reads this environment variable and mounts the necessary files (libaries, device nodes, binaries [such as nvidia-smi]) into the container.

Without NVIDIA's device plugin, containers requesting a nvshare.com/gpu device will not see the device exposed when running and will fail.

TL;DR:

For the time being, nvidia-device-plugin [1] is a strict prerequisite for operatingnvshare on Kubernetes.

[1] https://github.com/NVIDIA/k8s-device-plugin

cjidboon94 · 2023-07-31T13:24:29Z

Thanks for clarifying. Will see if I can change GKE's device plugin easily to Nvidia's upstream and then get the rest to work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE: Pods cannot access/detect GPU device + drive on GPU nodes #6

GKE: Pods cannot access/detect GPU device + drive on GPU nodes #6

cjidboon94 commented Jul 27, 2023 •

edited

Loading

grgalex commented Jul 27, 2023 •

edited

Loading

cjidboon94 commented Jul 31, 2023

GKE: Pods cannot access/detect GPU device + drive on GPU nodes #6

GKE: Pods cannot access/detect GPU device + drive on GPU nodes #6

Comments

cjidboon94 commented Jul 27, 2023 • edited Loading

grgalex commented Jul 27, 2023 • edited Loading

TL;DR:

cjidboon94 commented Jul 31, 2023

cjidboon94 commented Jul 27, 2023 •

edited

Loading

grgalex commented Jul 27, 2023 •

edited

Loading