-
Notifications
You must be signed in to change notification settings - Fork 2k
start docker nvidia fail could not select device driver "" with capabilities: [[gpu]] #1682
Comments
Sorry I can't help, but I have the exact same issue - all was working before, then after an update (Driver Version: 515.65.01) and a reboot the gpu doesn't work in docker anymore. I'm running a Quadro P400 on RHEL 8.6. |
@ywangwxd / @c-patrick could you provide the docker commands that you are running? We have seen reports of issues with the NVIDIA Container Toolkit v1.11.0, so this may indicate a regression in those components. Could you:
|
@elezar thanks for looking into this. The command I'm running is: I've uncommented the |
@c-patrick could you provide the output for:
|
@elezar Sure, please find the output below:
|
OK, I would expect a
In the v1.11.0 release we switched these as we want to use
However, due to the way the RPM packages are defined the symlink is (unconditionally) removed in the post uninstall step. For 1.11.0 we have:
For 1.10.0 we had:
What this means is that when upgrading from The workaround is to remove the
And then confirm the following:
@ywangwxd since you're using ubuntu and not RHEL, I would have to check the packages there a bit more closely, but I can see a similar situation occuring there. |
@elezar Thanks very much for your help. I removed and then installed NVIDIA container toolkit and all is working well. Running
Now running Thank you very much again for your help. |
thank you although I have solved the issue in another way. I searched on google and another post said it was because the docker is installed in a snap mode (I do not know what it is actually), trying reinstalling it. I found this way solved my problem. Anyway, I will keep your response in mind. I may encounter the same problem again in the future, who knows. |
Thanks for this followed your advice fixed the problem for me. Found countless info on ubuntu for similar problems but this was just what I needed for CentOS Stream. Ta. |
Hi @elezar , thanks a lot for your comments and detailed description ... below solution worked for me
|
I'm experiencing this issue as well at the moment on Flatcar Linux using Docker and the |
Update: I just fixed this... https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html dictates that you can use an env var. Using this environment variable it does work! Is there a bug in the --gpus all code? |
The |
Looking through this problem again, note that reinstalling the
Running:
Ensures that this file is installed correctly:
|
I'm running this as the Docker container on Flatcar Linux, since you cannot install anything on Flatcar. |
I can confirm this 1.10 --> 1.11 upgrade breaks in Redhat/RPM based OS too.
ReproduceHere are steps to reproduce:
FixManual FixSimply reinstalling it fixed it. Confirmed on two hosts at least.
Chef FixHere is a chef recipe I used to fix it for anyone using Chef. One must do a FULL uninstall and reinstall. There is no 'reinstall' action in Chef. Here is how I implemented it in Chef currently:
First chef run removes the package:
2nd (and subsequent) Chef runs should do nothing:
|
I have come across the same issue and can confirm that it also happens on CentOS 7 here. After upgrading Ideally, I’m looking for a solution for this issue where the RPM upgrade would resolve this problem automatically. In the company I work for we provide software updates by deploying RPMs to target machines, where they get updated automatically. It is difficult for us to apply the workaround of first uninstalling 1.10.0 before updating. May I suggest the following solution: For testing, I added a post scriptlet to the
In the posttrans scriplet, I added a few lines that restore the file later, if it got deleted by 1.10.0 during uninstall:
I believe I saw for the downgrade case (back to 1.10.0) you have already added a fix (don’t remove the file if it isn’t a symlink). I am not sure about the Debian/Ubuntu package, as I am not familiar with deb packaging. But if it is affected by this issue, too, then there could be a similar solution. I think this would also be beneficial for other users, who might not be aware of this issue and the workaround. This change would fix it automatically. |
@cvolz thanks for the detailed investigation. Would you be up to creating a merge request against https://gitlab.com/nvidia/container-toolkit/container-toolkit with your proposed changes so that these could be reviewed and included in the next release? |
Hi @elezar, I'm open to contributing a merge request, but the question may be when I get to do that, as I am currently tied up at work. And I have not used your build environment and gitlab.com yet, so I will probably need some extra time to get set up. When are you planning the next release? |
Hi @elezar, I have just opened the merge request for above patch: Gitlab !263 I have succeeded in building the RPM package and testing the upgrade and downgrade from/to 1.10.0 and it seems that the |
Sorry for the delay. I had a look at the MR yesterday. One small question / comment. The next non-RC release should go out by the end of the month. |
This problem has come up again. Ubuntu 20.04, NVidia driver 535.86.05. Driver works on the host.
Can not get GPU to work with docker. Have reinstalled docker, reinstalled the nvidia-container-toolkit. No change.
The hooks are all in place.
Everything is latest versions:
|
And let this be a lesson in proper use of
The error message could have been more helpful. Then again, if someone can set the context to something else, they can keep track of it. |
Very good |
I am following the official instructions to install the latest nvidia-docker2, nvidia-container-toolkit.
OS: ubuntu 18.04
But I cannot start docker with nvidia driver. the error message is :
##############################################
could not select device driver "" with capabilities: [[gpu]]
#############################################
On the host, I have already installed nvidia driver and I can see the device using nvidia-smi command
-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 0% 45C P8 16W / 220W | 233MiB / 7973MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3943 G /usr/lib/xorg/Xorg 18MiB |
| 0 N/A N/A 3976 G /usr/bin/gnome-shell 71MiB |
| 0 N/A N/A 4160 G /usr/lib/xorg/Xorg 112MiB |
| 0 N/A N/A 4297 G /usr/bin/gnome-shell 27MiB |
+-----------------------------------------------------------------------------+
I can also see the device under /dev as follows
/dev/nvidia0 /dev/nvidiactl /dev/nvidia-modeset /dev/nvidia-uvm /dev/nvidia-uvm-tools
I check the log of nvidia-container-cli, I can see the following warning message
-- WARNING, the following logs are for debugging purposes only --
I0919 09:04:42.269104 28911 nvc.c:376] initializing library context (version=1.11.0, build=c8f267be0bac1c654d59ad4ea5df907141149977)
I0919 09:04:42.269187 28911 nvc.c:350] using root /
I0919 09:04:42.269210 28911 nvc.c:351] using ldcache /etc/ld.so.cache
I0919 09:04:42.269240 28911 nvc.c:352] using unprivileged user 1001:1001
I0919 09:04:42.269299 28911 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0919 09:04:42.269596 28911 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0919 09:04:42.270954 28913 nvc.c:273] failed to set inheritable capabilities
W0919 09:04:42.271053 28913 nvc.c:274] skipping kernel modules load due to failure
I0919 09:04:42.271560 28914 rpc.c:71] starting driver rpc service
I0919 09:04:42.276809 28915 rpc.c:71] starting nvcgo rpc service
I0919 09:04:42.277317 28911 nvc_info.c:766] requesting driver information with ''
I0919 09:04:42.278263 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.470.141.03
I0919 09:04:42.278294 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.141.03
I0919 09:04:42.278310 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.470.141.03
I0919 09:04:42.278330 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.141.03
I0919 09:04:42.278346 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.470.141.03
I0919 09:04:42.278362 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.141.03
I0919 09:04:42.278380 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.470.141.03
I0919 09:04:42.278397 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.141.03
I0919 09:04:42.278412 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.470.141.03
I0919 09:04:42.278426 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.470.141.03
I0919 09:04:42.278440 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.470.141.03
I0919 09:04:42.278455 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.470.141.03
I0919 09:04:42.278471 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.470.141.03
I0919 09:04:42.278487 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.470.141.03
I0919 09:04:42.278504 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.470.141.03
I0919 09:04:42.278522 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.141.03
I0919 09:04:42.278544 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.141.03
I0919 09:04:42.278563 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.470.141.03
I0919 09:04:42.278583 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.141.03
I0919 09:04:42.278604 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.470.141.03
I0919 09:04:42.278728 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.470.141.03
I0919 09:04:42.278790 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.470.141.03
I0919 09:04:42.278813 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.470.141.03
I0919 09:04:42.278833 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.470.141.03
I0919 09:04:42.278854 28911 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.470.141.03
I0919 09:04:42.278887 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.470.141.03
I0919 09:04:42.278905 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.470.141.03
I0919 09:04:42.278931 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.470.141.03
I0919 09:04:42.278959 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.470.141.03
I0919 09:04:42.278979 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.470.141.03
I0919 09:04:42.279006 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ifr.so.470.141.03
I0919 09:04:42.279033 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.470.141.03
I0919 09:04:42.279053 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.470.141.03
I0919 09:04:42.279074 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.470.141.03
I0919 09:04:42.279094 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.470.141.03
I0919 09:04:42.279120 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.470.141.03
I0919 09:04:42.279145 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.470.141.03
I0919 09:04:42.279165 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.470.141.03
I0919 09:04:42.279186 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.470.141.03
I0919 09:04:42.279222 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.470.141.03
I0919 09:04:42.279255 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.470.141.03
I0919 09:04:42.279277 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.470.141.03
I0919 09:04:42.279297 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.470.141.03
I0919 09:04:42.279318 28911 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.470.141.03
W0919 09:04:42.279332 28911 nvc_info.c:399] missing library libnvidia-nscq.so
W0919 09:04:42.279337 28911 nvc_info.c:399] missing library libcudadebugger.so
W0919 09:04:42.279340 28911 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W0919 09:04:42.279344 28911 nvc_info.c:399] missing library libnvidia-pkcs11.so
W0919 09:04:42.279349 28911 nvc_info.c:399] missing library libvdpau_nvidia.so
W0919 09:04:42.279354 28911 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W0919 09:04:42.279358 28911 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W0919 09:04:42.279362 28911 nvc_info.c:403] missing compat32 library libcudadebugger.so
W0919 09:04:42.279367 28911 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W0919 09:04:42.279371 28911 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W0919 09:04:42.279376 28911 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W0919 09:04:42.279380 28911 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W0919 09:04:42.279384 28911 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W0919 09:04:42.279388 28911 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W0919 09:04:42.279391 28911 nvc_info.c:403] missing compat32 library libnvoptix.so
W0919 09:04:42.279395 28911 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I0919 09:04:42.279667 28911 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I0919 09:04:42.279678 28911 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I0919 09:04:42.279690 28911 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I0919 09:04:42.279703 28911 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
W0919 09:04:42.279756 28911 nvc_info.c:425] missing binary nv-fabricmanager
W0919 09:04:42.279760 28911 nvc_info.c:425] missing binary nvidia-cuda-mps-server
I0919 09:04:42.279775 28911 nvc_info.c:343] listing firmware path /lib/firmware/nvidia/470.141.03/gsp.bin
I0919 09:04:42.279789 28911 nvc_info.c:529] listing device /dev/nvidiactl
I0919 09:04:42.279792 28911 nvc_info.c:529] listing device /dev/nvidia-uvm
I0919 09:04:42.279797 28911 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I0919 09:04:42.279800 28911 nvc_info.c:529] listing device /dev/nvidia-modeset
I0919 09:04:42.279814 28911 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
W0919 09:04:42.279828 28911 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W0919 09:04:42.279837 28911 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I0919 09:04:42.279842 28911 nvc_info.c:822] requesting device information with ''
I0919 09:04:42.285437 28911 nvc_info.c:713] listing device /dev/nvidia0 (GPU-661838a0-fb69-bf82-164a-6c9ae0dcc7f6 at 00000000:01:00.0)
I0919 09:04:42.285446 28911 nvc.c:434] shutting down library context
I0919 09:04:42.285493 28915 rpc.c:95] terminating nvcgo rpc service
I0919 09:04:42.285765 28911 rpc.c:135] nvcgo rpc service terminated successfully
I0919 09:04:42.286026 28914 rpc.c:95] terminating driver rpc service
I0919 09:04:42.286086 28911 rpc.c:135] driver rpc service terminated successfully
NVRM version: 470.141.03
CUDA version: 11.4
Device Index: 0
Device Minor: 0
Model: NVIDIA GeForce RTX 3070
Brand: GeForce
GPU UUID: GPU-661838a0-fb69-bf82-164a-6c9ae0dcc7f6
Bus Location: 00000000:01:00.0
Architecture: 8.6
The strange thing is that I could succesfully use the docker with nvidia gpu before, it failed just after a reboot.
There has been nothing changed if my memory is correct. I have also tried reinstalling nvidia-container-toolkit, nvidia-docker2
what can I do now?
The text was updated successfully, but these errors were encountered: