Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notes on upgrading cuda version on host #28

Closed
cboettig opened this issue Apr 5, 2020 · 1 comment
Closed

Notes on upgrading cuda version on host #28

cboettig opened this issue Apr 5, 2020 · 1 comment

Comments

@cboettig
Copy link
Member

cboettig commented Apr 5, 2020

After upgrading the nvidia drivers on the host (e.g. with apt-get upgrade), nvidia tasks will fail to run due to driver mistmatch, e.g.:

 $ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

Rebooting the machine resolves this, but if that is not a convenient option for a server, this can be done manually: stop any running Xorg instances (e.g. sudo service stop gdm3) and then sudo rmmod nvidia. The latter may fail by listing submodules that are still running, so stop these as well, sudo rmmod nvidia-uvm. Then restart the nvidia drivers with sudo nvidia-smi to confirm GPU is back and running.

see:

Running nvidia-docker instances, e.g. with docker run --gpus all ... should now work again as before. Should add this to the user docs when we get to writing down more stuff about CUDA images...

@cboettig
Copy link
Member Author

nvidia driver mismatch

Restart NVIDIA drivers without rebooting machine:

https://stackoverflow.com/a/45319156/258662

Stop all GPU-binding tasks:

sudo lsof /dev/nvidia*

Then remove all nvidia modules:

lsmod | grep nvidia

e.g.

sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
sudo rmmod nvidia

confirm nvidia-smi now works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant