Notes on upgrading cuda version on host #28

cboettig · 2020-04-05T23:31:18Z

After upgrading the nvidia drivers on the host (e.g. with apt-get upgrade), nvidia tasks will fail to run due to driver mistmatch, e.g.:

 $ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

Rebooting the machine resolves this, but if that is not a convenient option for a server, this can be done manually: stop any running Xorg instances (e.g. sudo service stop gdm3) and then sudo rmmod nvidia. The latter may fail by listing submodules that are still running, so stop these as well, sudo rmmod nvidia-uvm. Then restart the nvidia drivers with sudo nvidia-smi to confirm GPU is back and running.

see:

Running nvidia-docker instances, e.g. with docker run --gpus all ... should now work again as before. Should add this to the user docs when we get to writing down more stuff about CUDA images...

The text was updated successfully, but these errors were encountered:

cboettig · 2025-02-10T05:02:36Z

nvidia driver mismatch

Restart NVIDIA drivers without rebooting machine:

https://stackoverflow.com/a/45319156/258662

Stop all GPU-binding tasks:

sudo lsof /dev/nvidia*

Then remove all nvidia modules:

lsmod | grep nvidia

e.g.

sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
sudo rmmod nvidia

confirm nvidia-smi now works.

cboettig mentioned this issue Apr 24, 2020

GPU versioning rocker-org/rocker-versioned2#1

Closed

cboettig closed this as completed Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notes on upgrading cuda version on host #28

Notes on upgrading cuda version on host #28

cboettig commented Apr 5, 2020

cboettig commented Feb 10, 2025

Notes on upgrading cuda version on host #28

Notes on upgrading cuda version on host #28

Comments

cboettig commented Apr 5, 2020

cboettig commented Feb 10, 2025

nvidia driver mismatch