Skip to content

Latest commit

 

History

History
120 lines (94 loc) · 5.06 KB

README.md

File metadata and controls

120 lines (94 loc) · 5.06 KB

NVIDIA GPUs on CoreOS Container Linux

Leveraging NVIDIA GPUs on Container Linux involves the following steps:

  • compiling the NVIDIA kernel modules;
  • loading the kernel modules on demand;
  • creating NVIDIA device files; and
  • loading NVIDIA libraries

Compounding this complexity further is the fact that these steps have to be executed whenever the Container Linux system updates since the modules may no longer be compatible with the new kernel.

Modulus takes care of automating all of these steps and ensures that kernel modules are up-to-date for the host's kernel.

Installation for Kubernetes

Requirements

You will need a running Kubernetes cluster and the kubectl command to deploy Modulus.

Getting Started

Edit the provided Modulus DaemonSet to specify the version of NVIDIA you would like to compile, e.g. 470.103.01. Then create the deployment:

kubectl apply -f https://raw.githubusercontent.com/squat/modulus/main/nvidia/daemonset.yaml

This DaemonSet will run on a Modulus pod on all the Kubernetes nodes. You may choose to add a nodeSelector to schedule Modulus exclusively to nodes with GPUs.

Installation for Systemd

Requirements

First, make sure you have the Modulus code available on your Container Linux machine and that the modulus service is installed.

Getting Started

Enable and start the modulus template unit file with the desired NVIDIA version, e.g. 470.103.01:

sudo systemctl enable modulus@nvidia-470.103.01
sudo systemctl start modulus@nvidia-470.103.01

This service takes care of automatically compiling, installing, backing up, and loading the NVIDIA kernel modules as well as creating the NVIDIA device files.

Compiling the NVIDIA kernel modules can take between 10-15 minutes depending on your Internet speed, CPU, and RAM. To check the progress of the compilation, run:

journalctl -fu modulus@nvidia-470.103.01

Verify

Once Modulus has successfully run, the host should have NVIDIA device files and kernel modules loaded. To verify that the kernel modules were loaded, run:

lsmod | grep nvidia

This should return something like:

nvidia_uvm            626688  2
nvidia              12267520  35 nvidia_uvm
...

Verify that the devices were created with:

ls /dev/nvidia*

This should produce output like:

/dev/nvidia-uvm  /dev/nvidia0  /dev/nvidiactl

Finally, try running the NVIDIA system monitoring interface (SMI) command, nvidia-smi, to check the status of the connected GPU:

/opt/drivers/nvidia/bin/nvidia-smi

If your GPU is connected, this command will return information about the model, temperature, memory usage, GPU utilization etc.

Leveraging NVIDIA GPUs in Containers

Now that the kernel modules are loaded, devices are present, and libraries have been created, the connected GPU can be utilized in containerized applications.

In order to give the container access to the GPU, the device files must be explicitly loaded in the namespace, and the NVIDIA libraries and binaries must be mounted in the container. Consider the following command, which runs the nvidia-smi command inside of a Docker container:

docker run -it \
--device=/dev/nvidiactl \
--device=/dev/nvidia-uvm \
--device=/dev/nvidia0 \
--volume=/opt/drivers/nvidia:/usr/local/nvidia:ro \
--entrypoint=nvidia-smi \
nvidia/cuda:9.1-devel

There exist plugins that help with automating the loading of GPU devices in Docker containers; for more information, checkout the NVIDIA-Docker repository.

Leveraging NVIDIA GPUs in Kubernetes

In order to make use of the NVIDIA drivers and devices in your Kubernetes workloads, you will need to deploy a Kubernetes device plugin for NVIDIA GPUs. Drivers compiled with Modulus work seamlessly with the Kubernetes device plugin provided upstream in the addons directory as well as the official NVIDIA device plugin.

Deploying the former requires no special NVIDIA container runtime and can be done with one command:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml

Once the device plugin is running, verify that the desired nodes have allocatable GPUs:

kubectl describe node <node-name>

Compatibility Matrix

Here we list combinations of Flatcar Linux, Linux kernel and NVIDIA versions for which it is known that the NVIDIA modules build successfully using modulus. This table was created in response to NVIDIA modules failing to build on recent versions of Flatcar Linux in the past:

Flatcar Kernel NVIDIA
2605.12.0 5.4.92 440.64
3510.2.8 5.15.129 470.103.01
3602.1.6 5.15.132 470.103.01
3510.2.7 5.15.125 535.104.05
3745.0.0 6.1.55 535.104.05