Skip to content

Latest commit

 

History

History
38 lines (26 loc) · 1.74 KB

SHARED_GPU_TUTORIAL.md

File metadata and controls

38 lines (26 loc) · 1.74 KB

Tutorial

We will deploy two types of daemonsets via the helm chart. One to support sharing of GPUs and one to support exclusive (regular non-shared GPUs).

Label GPU nodes as either shared or exclusive.

kubectl label node nodename1 gpu-node-type=exclusive
...
kubectl label node nodename2 gpu-node-type=shared
...

Deploy the driver twice. Once for the exclusive GPU nodes and once for the shared GPU nodes based on the above node labels.

helm install nvidia-exclusive deployments/helm/nvidia-device-plugin --set nodeSelector='gpu-node-type=exclusive'
helm install nvidia-exclusive deployments/helm/nvidia-device-plugin --set nodeSelector='gpu-node-type=shared' --set resourceConfig='gpu:sharedgpu:4'

You can use kubectl-view-allocations to see that you now have two extended resource types. The tool will also tell you the quantity in use and available.

  • nvidia.com/gpu
  • nvidia.com/sharedgpu

Create two pods with shared GPUs that do some work on the GPUs.

kubectl create -f pods/pod-shared-pytorch.yml
kubectl create -f pods/pod-shared-pytorch.yml

Inspect the logs of the pods to see that both are using pytorch with CUDA support to train a DNN model on MNIST.
You can also run nvidia-smi -L to see that the GPU is the same (given that they are actually sharing a GPU and didn't pick up different GPUs if you have more than one GPU on your system).

Note that in the pods, nvidia-smi only lists that pod's processes (yet the percentage is actually for all processes) but from the host everything for a given GPU.

You can also create pods with exclusive ownership by replacing "nvidia.com/sharedgpu" with "nvidia.com/gpu" as shown in this example pod.