Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE: Pods cannot access/detect GPU device + drive on GPU nodes #6

Open
cjidboon94 opened this issue Jul 27, 2023 · 2 comments
Open

Comments

@cjidboon94
Copy link

cjidboon94 commented Jul 27, 2023

When trying to setup nvshare on GKE, installation goes fine and scheduling pods (e.g. the test pods from the README or a simple cuda pod that runs nvidia-smi) goes fine. nvshare.com/gpu gets consumed. However, pods error with nvidia-smi is not found or in the case of e.g. from the pytorch small pod:
Traceback (most recent call last): File "/pytorch-add-small.py", line 29, in <module> device = torch.cuda.current_device() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 479, in current_device _lazy_init() File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 214, in _lazy_init torch._C._cuda_init() RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

When scheduling the pod by requesting nvidia.com/gpu, the GPU is visible and the drivers + nvidia-smi are available.

Setup:
GKE k8s version: 1.25.10-gke.2700
nvidia-gpu-device-plugin: GKE's own GPU device plugin

How to reproduce:

       - name: NVIDIA_VISIBLE_DEVICES
         value: "0"
  • Optional: Add the following affinity to nvshare-device-plugin daemonset so that pods only get scheduled on GPU nodes.
    •   affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: cloud.google.com/gke-accelerator
                  operator: Exists 
      
      
      
  • Deploy a pod that requests an nvshare.com/gpu resource:
apiVersion: v1
kind: Pod 
metadata:
  name: test-nvshare
spec:
  restartPolicy: OnFailure\
  tolerations:
  - key: nvidia.com/gpu
    effect: NoSchedule
    operator: Exists
  containers:
      - name: test-nvshare
      	env:
      	- name: NVSHARE_DEBUG
      	  value: "1"
        image: nvidia/cuda:11.0.3-base-ubi7
        command:
        - bash
        - -c
        - |
          /usr/local/nvidia/bin/nvidia-smi -L; sleep 300
        resources:
          limits:
            nvshare.com/gpu: 1      

Expectation when checking logs: Get GPU information GPU 0: NVIDIA L4 (UUID: GPU-7e0c893c-3254-dfa8-db40-73942c3de761) (This is what you see when scheduling with a nvidia.com/gpu request)
Actual output: bash: /usr/local/nvidia/bin/nvidia-smi: No such file or directory

@grgalex
Copy link
Owner

grgalex commented Jul 27, 2023

@cjidboon94

Unfortunately, nvshare currently strictly depends on NVIDIA's upstream K8s device plugin [1].

This is because nvshare's implementation is strictly coupled with NVIDIA's container runtime.

When I have some time next week, I will elaborate on this fully.

A short summary is that nvshare-device-plugin sets the NVIDIA_VISIBLE_DEVICES environment var (or its symbolic /dev/null mount alternative) in containers that request nvshare.com/gpu. NVIDIA's container runtime, which is a runc hook that runs the containers on the node reads this environment variable and mounts the necessary files (libaries, device nodes, binaries [such as nvidia-smi]) into the container.

Without NVIDIA's device plugin, containers requesting a nvshare.com/gpu device will not see the device exposed when running and will fail.

TL;DR:

For the time being, nvidia-device-plugin [1] is a strict prerequisite for operatingnvshare on Kubernetes.

[1] https://github.com/NVIDIA/k8s-device-plugin

@cjidboon94
Copy link
Author

Thanks for clarifying. Will see if I can change GKE's device plugin easily to Nvidia's upstream and then get the rest to work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants