Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-cli not detecting mig devices #86

Closed
sargreal opened this issue Aug 3, 2023 · 7 comments
Closed

nvidia-container-cli not detecting mig devices #86

sargreal opened this issue Aug 3, 2023 · 7 comments

Comments

@sargreal
Copy link

sargreal commented Aug 3, 2023

So the issue is probably quite clear from the title. MIG devices are setup and work perfectly, however nvidia-container-cli (and everything that uses it) does not find those devices.

The problem most likely comes from some installation problem at some point, however I could not find that point yet even after many reinstalls of all nvidia drivers. There are also people with a very similar setup that have gotten it to work without this issue.

Although I already posted on several other forums, I will list all of my installed versions, logs and everything I could find here, in the hopes that anybody might find why this does not work.

Running nvidia-smi on bare metal:

root@gpu1:~# nvidia-smi 
Thu Aug  3 14:07:25 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:CA:00.0 Off |                   On |
| N/A   35C    P0              42W / 300W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    3   0   0  |              25MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    4   0   1  |              12MiB / 19968MiB  | 14      0 |  1   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   11   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   12   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   13   0   4  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   14   0   5  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
root@gpu1:~# nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-595a802e-9268-e1f8-cad5-9a69202e4cd5)
  MIG 2g.20gb     Device  0: (UUID: MIG-5a62f918-fd78-5a32-9614-008a1471bf61)
  MIG 1g.20gb     Device  1: (UUID: MIG-4759842f-f141-5b89-8f18-a8fc14133926)
  MIG 1g.10gb     Device  2: (UUID: MIG-c28f1ba1-5c54-5f04-a3ce-8b4eabb6d542)
  MIG 1g.10gb     Device  3: (UUID: MIG-566c026a-0826-5fd6-847a-b32e59131bdb)
  MIG 1g.10gb     Device  4: (UUID: MIG-c64a168b-dba3-5bc5-b5d4-0be7af011017)
  MIG 1g.10gb     Device  5: (UUID: MIG-ea6e7ab3-3a7e-5a6b-a133-2ed92d29bd97)

Running nvidia-smi in docker with all gpus:

root@gpu1:~# docker run --rm --gpus all nvidia/cuda:12.1.1-base-ubuntu20.04 nvidia-smi
Thu Aug  3 12:06:58 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:CA:00.0 Off |                   On |
| N/A   36C    P0              42W / 300W |                  N/A |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  No MIG devices found                                                                 |
+---------------------------------------------------------------------------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Trying to run with single mig device:

root@gpu1:~# docker run --rm --gpus '"device=0:0"' nvidia/cuda:12.1.1-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: device error: 0:0: unknown device: unknown.
Logs of nvidia-container-cli when doing this

I0803 12:09:42.262179 119924 nvc.c:376] initializing library context (version=1.13.5, build=66607bd046341f7aad7de80a9f022f122d1f2fce)
I0803 12:09:42.262266 119924 nvc.c:350] using root /
I0803 12:09:42.262273 119924 nvc.c:351] using ldcache /etc/ld.so.cache
I0803 12:09:42.262279 119924 nvc.c:352] using unprivileged user 65534:65534
I0803 12:09:42.262303 119924 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0803 12:09:42.262504 119924 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I0803 12:09:42.269532 119930 nvc.c:278] loading kernel module nvidia
I0803 12:09:42.270117 119930 nvc.c:282] running mknod for /dev/nvidiactl
I0803 12:09:42.270205 119930 nvc.c:286] running mknod for /dev/nvidia0
I0803 12:09:42.270257 119930 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I0803 12:09:42.281597 119930 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
E0803 12:09:42.281675 119930 nvc.c:292] could not create kernel module device nodes: error running mknod for nvcap: /proc/driver/nvidia/capabilities/mig/config
I0803 12:09:42.281680 119930 nvc.c:296] loading kernel module nvidia_uvm
I0803 12:09:42.281738 119930 nvc.c:300] running mknod for /dev/nvidia-uvm
I0803 12:09:42.281777 119930 nvc.c:305] loading kernel module nvidia_modeset
I0803 12:09:42.282086 119930 nvc.c:309] running mknod for /dev/nvidia-modeset
I0803 12:09:42.282502 119931 rpc.c:71] starting driver rpc service
I0803 12:09:42.320364 119934 rpc.c:71] starting nvcgo rpc service
I0803 12:09:42.322423 119924 nvc_container.c:240] configuring container with 'compute utility supervised'
I0803 12:09:42.322949 119924 nvc_container.c:88] selecting /var/lib/docker/overlay2/98edd5569298673e04e5f59dbdfd3db78c016d447d1da39ff6354436cba0246e/merged/usr/local/cuda-12.1/compat/libcuda.so.530.30.02
I0803 12:09:42.323076 119924 nvc_container.c:88] selecting /var/lib/docker/overlay2/98edd5569298673e04e5f59dbdfd3db78c016d447d1da39ff6354436cba0246e/merged/usr/local/cuda-12.1/compat/libcudadebugger.so.530.30.02
I0803 12:09:42.323169 119924 nvc_container.c:88] selecting /var/lib/docker/overlay2/98edd5569298673e04e5f59dbdfd3db78c016d447d1da39ff6354436cba0246e/merged/usr/local/cuda-12.1/compat/libnvidia-nvvm.so.530.30.02
I0803 12:09:42.323265 119924 nvc_container.c:88] selecting /var/lib/docker/overlay2/98edd5569298673e04e5f59dbdfd3db78c016d447d1da39ff6354436cba0246e/merged/usr/local/cuda-12.1/compat/libnvidia-ptxjitcompiler.so.530.30.02
I0803 12:09:42.326708 119924 nvc_container.c:262] setting pid to 119918
I0803 12:09:42.326752 119924 nvc_container.c:263] setting rootfs to /var/lib/docker/overlay2/98edd5569298673e04e5f59dbdfd3db78c016d447d1da39ff6354436cba0246e/merged
I0803 12:09:42.326767 119924 nvc_container.c:264] setting owner to 0:0
I0803 12:09:42.326781 119924 nvc_container.c:265] setting bins directory to /usr/bin
I0803 12:09:42.326794 119924 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu
I0803 12:09:42.326808 119924 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu
I0803 12:09:42.326821 119924 nvc_container.c:268] setting cudart directory to /usr/local/cuda
I0803 12:09:42.326834 119924 nvc_container.c:269] setting ldconfig to @/sbin/ldconfig (host relative)
I0803 12:09:42.326848 119924 nvc_container.c:270] setting mount namespace to /proc/119918/ns/mnt
I0803 12:09:42.326861 119924 nvc_container.c:272] detected cgroupv2
I0803 12:09:42.326874 119924 nvc_container.c:273] setting devices cgroup to /sys/fs/cgroup/system.slice/docker-892a41be29de3012114ed676afc08c8da3782304a93ec868c6f27fbddaf24588.scope
I0803 12:09:42.326899 119924 nvc_info.c:798] requesting driver information with ''
I0803 12:09:42.328916 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvoptix.so.535.86.10
I0803 12:09:42.329072 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.535.86.10
I0803 12:09:42.329153 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.535.86.10
I0803 12:09:42.329288 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.535.86.10
I0803 12:09:42.329359 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.535.86.10
I0803 12:09:42.329422 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.535.86.10
I0803 12:09:42.329539 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-opticalflow.so.535.86.10
I0803 12:09:42.329672 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-opencl.so.535.86.10
I0803 12:09:42.329834 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-nvvm.so.535.86.10
I0803 12:09:42.330010 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.535.86.10
I0803 12:09:42.330137 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.535.86.10
I0803 12:09:42.330216 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.535.86.10
I0803 12:09:42.330293 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.535.86.10
I0803 12:09:42.330367 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.535.86.10
I0803 12:09:42.330499 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-fbc.so.535.86.10
I0803 12:09:42.330627 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.535.86.10
I0803 12:09:42.330706 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.535.86.10
I0803 12:09:42.330886 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.535.86.10
I0803 12:09:42.331014 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-allocator.so.535.86.10
I0803 12:09:42.331158 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.535.86.10
I0803 12:09:42.331704 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libcudadebugger.so.535.86.10
I0803 12:09:42.331826 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.535.86.10
I0803 12:09:42.332154 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.535.86.10
I0803 12:09:42.332287 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv2_nvidia.so.535.86.10
I0803 12:09:42.332419 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv1_CM_nvidia.so.535.86.10
I0803 12:09:42.332553 119924 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.535.86.10
W0803 12:09:42.332603 119924 nvc_info.c:402] missing library libnvidia-nscq.so
W0803 12:09:42.332619 119924 nvc_info.c:402] missing library libnvidia-gpucomp.so
W0803 12:09:42.332632 119924 nvc_info.c:402] missing library libnvidia-fatbinaryloader.so
W0803 12:09:42.332645 119924 nvc_info.c:402] missing library libnvidia-compiler.so
W0803 12:09:42.332658 119924 nvc_info.c:402] missing library libvdpau_nvidia.so
W0803 12:09:42.332671 119924 nvc_info.c:402] missing library libnvidia-ifr.so
W0803 12:09:42.332684 119924 nvc_info.c:402] missing library libnvidia-cbl.so
W0803 12:09:42.332697 119924 nvc_info.c:406] missing compat32 library libnvidia-ml.so
W0803 12:09:42.332710 119924 nvc_info.c:406] missing compat32 library libnvidia-cfg.so
W0803 12:09:42.332723 119924 nvc_info.c:406] missing compat32 library libnvidia-nscq.so
W0803 12:09:42.332736 119924 nvc_info.c:406] missing compat32 library libcuda.so
W0803 12:09:42.332749 119924 nvc_info.c:406] missing compat32 library libcudadebugger.so
W0803 12:09:42.332762 119924 nvc_info.c:406] missing compat32 library libnvidia-opencl.so
W0803 12:09:42.332774 119924 nvc_info.c:406] missing compat32 library libnvidia-gpucomp.so
W0803 12:09:42.332787 119924 nvc_info.c:406] missing compat32 library libnvidia-ptxjitcompiler.so
W0803 12:09:42.332800 119924 nvc_info.c:406] missing compat32 library libnvidia-fatbinaryloader.so
W0803 12:09:42.332813 119924 nvc_info.c:406] missing compat32 library libnvidia-allocator.so
W0803 12:09:42.332825 119924 nvc_info.c:406] missing compat32 library libnvidia-compiler.so
W0803 12:09:42.332838 119924 nvc_info.c:406] missing compat32 library libnvidia-pkcs11.so
W0803 12:09:42.332851 119924 nvc_info.c:406] missing compat32 library libnvidia-pkcs11-openssl3.so
W0803 12:09:42.332864 119924 nvc_info.c:406] missing compat32 library libnvidia-nvvm.so
W0803 12:09:42.332877 119924 nvc_info.c:406] missing compat32 library libnvidia-ngx.so
W0803 12:09:42.332889 119924 nvc_info.c:406] missing compat32 library libvdpau_nvidia.so
W0803 12:09:42.332902 119924 nvc_info.c:406] missing compat32 library libnvidia-encode.so
W0803 12:09:42.332915 119924 nvc_info.c:406] missing compat32 library libnvidia-opticalflow.so
W0803 12:09:42.332928 119924 nvc_info.c:406] missing compat32 library libnvcuvid.so
W0803 12:09:42.332941 119924 nvc_info.c:406] missing compat32 library libnvidia-eglcore.so
W0803 12:09:42.332953 119924 nvc_info.c:406] missing compat32 library libnvidia-glcore.so
W0803 12:09:42.332966 119924 nvc_info.c:406] missing compat32 library libnvidia-tls.so
W0803 12:09:42.332979 119924 nvc_info.c:406] missing compat32 library libnvidia-glsi.so
W0803 12:09:42.332992 119924 nvc_info.c:406] missing compat32 library libnvidia-fbc.so
W0803 12:09:42.333005 119924 nvc_info.c:406] missing compat32 library libnvidia-ifr.so
W0803 12:09:42.333017 119924 nvc_info.c:406] missing compat32 library libnvidia-rtcore.so
W0803 12:09:42.333030 119924 nvc_info.c:406] missing compat32 library libnvoptix.so
W0803 12:09:42.333043 119924 nvc_info.c:406] missing compat32 library libGLX_nvidia.so
W0803 12:09:42.333056 119924 nvc_info.c:406] missing compat32 library libEGL_nvidia.so
W0803 12:09:42.333076 119924 nvc_info.c:406] missing compat32 library libGLESv2_nvidia.so
W0803 12:09:42.333089 119924 nvc_info.c:406] missing compat32 library libGLESv1_CM_nvidia.so
W0803 12:09:42.333102 119924 nvc_info.c:406] missing compat32 library libnvidia-glvkspirv.so
W0803 12:09:42.333115 119924 nvc_info.c:406] missing compat32 library libnvidia-cbl.so
I0803 12:09:42.333712 119924 nvc_info.c:302] selecting /usr/lib/nvidia/current/nvidia-smi
I0803 12:09:42.333811 119924 nvc_info.c:302] selecting /usr/lib/nvidia/current/nvidia-debugdump
I0803 12:09:42.333854 119924 nvc_info.c:302] selecting /usr/bin/nvidia-persistenced
I0803 12:09:42.333897 119924 nvc_info.c:302] selecting /usr/bin/nv-fabricmanager
I0803 12:09:42.333938 119924 nvc_info.c:302] selecting /usr/bin/nvidia-cuda-mps-control
I0803 12:09:42.333978 119924 nvc_info.c:302] selecting /usr/bin/nvidia-cuda-mps-server
I0803 12:09:42.334081 119924 nvc_info.c:488] listing firmware path /lib/firmware/nvidia/535.86.10/gsp_ga10x.bin
I0803 12:09:42.334096 119924 nvc_info.c:488] listing firmware path /lib/firmware/nvidia/535.86.10/gsp_tu10x.bin
I0803 12:09:42.334151 119924 nvc_info.c:561] listing device /dev/nvidiactl
I0803 12:09:42.334164 119924 nvc_info.c:561] listing device /dev/nvidia-uvm
I0803 12:09:42.334177 119924 nvc_info.c:561] listing device /dev/nvidia-uvm-tools
I0803 12:09:42.334190 119924 nvc_info.c:561] listing device /dev/nvidia-modeset
I0803 12:09:42.334249 119924 nvc_info.c:346] listing ipc path /run/nvidia-persistenced/socket
W0803 12:09:42.334300 119924 nvc_info.c:352] missing ipc path /var/run/nvidia-fabricmanager/socket
W0803 12:09:42.334336 119924 nvc_info.c:352] missing ipc path /tmp/nvidia-mps
I0803 12:09:42.334350 119924 nvc_info.c:854] requesting device information with ''
I0803 12:09:42.389707 119924 nvc_info.c:745] listing device /dev/nvidia0 (GPU-595a802e-9268-e1f8-cad5-9a69202e4cd5 at 00000000:ca:00.0)
I0803 12:09:42.389816 119924 nvc.c:434] shutting down library context
I0803 12:09:42.390011 119934 rpc.c:95] terminating nvcgo rpc service
I0803 12:09:42.391123 119924 rpc.c:135] nvcgo rpc service terminated successfully
I0803 12:09:42.399791 119931 rpc.c:95] terminating driver rpc service
I0803 12:09:42.400042 119924 rpc.c:135] driver rpc service terminated successfully

Running nvidia-container-cli list:

/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/lib/nvidia/current/nvidia-smi
/usr/lib/nvidia/current/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nv-fabricmanager
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-opencl.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-allocator.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-nvvm.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-opticalflow.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-fbc.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvoptix.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv2_nvidia.so.530.30.02
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv1_CM_nvidia.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.530.30.02
/run/nvidia-persistenced/socket
/lib/firmware/nvidia/530.30.02/gsp_ga10x.bin
/lib/firmware/nvidia/530.30.02/gsp_tu10x.bin

Actual nvidia devices

root@gpu1:~# ls -al /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Aug  3 10:48 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Aug  3 10:48 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Aug  3 10:48 /dev/nvidia-modeset
crw-rw-rw- 1 root root 508,   0 Aug  3 10:48 /dev/nvidia-uvm
crw-rw-rw- 1 root root 508,   1 Aug  3 10:48 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
drw-rw-rw-  2 root root      320 Aug  3 10:48 .
drwxr-xr-x 21 root root     4560 Aug  3 10:53 ..
cr--------  1 root root 238,   1 Aug  3 10:48 nvidia-cap1
cr--r--r--  1 root root 238, 102 Aug  3 10:48 nvidia-cap102
cr--r--r--  1 root root 238, 103 Aug  3 10:48 nvidia-cap103
cr--r--r--  1 root root 238, 111 Aug  3 10:48 nvidia-cap111
cr--r--r--  1 root root 238, 112 Aug  3 10:48 nvidia-cap112
cr--r--r--  1 root root 238, 120 Aug  3 10:48 nvidia-cap120
cr--r--r--  1 root root 238, 121 Aug  3 10:48 nvidia-cap121
cr--r--r--  1 root root 238, 129 Aug  3 10:48 nvidia-cap129
cr--r--r--  1 root root 238, 130 Aug  3 10:48 nvidia-cap130
cr--r--r--  1 root root 238,   2 Aug  3 10:48 nvidia-cap2
cr--r--r--  1 root root 238,  30 Aug  3 10:48 nvidia-cap30
cr--r--r--  1 root root 238,  31 Aug  3 10:48 nvidia-cap31
cr--r--r--  1 root root 238,  39 Aug  3 10:48 nvidia-cap39
cr--r--r--  1 root root 238,  40 Aug  3 10:48 nvidia-cap40

Installed versions

root@gpu1:~# dpkg -l | grep 'nvidia
> cuda'
ii  cuda                                 12.2.1-1                       amd64        CUDA meta-package
ii  cuda-12-2                            12.2.1-1                       amd64        CUDA 12.2 meta-package
ii  cuda-cccl-12-2                       12.2.128-1                     amd64        CUDA CCCL
ii  cuda-command-line-tools-12-2         12.2.1-1                       amd64        CUDA command-line tools
ii  cuda-compiler-12-2                   12.2.1-1                       amd64        CUDA compiler
ii  cuda-crt-12-2                        12.2.128-1                     amd64        CUDA crt
ii  cuda-cudart-12-2                     12.2.128-1                     amd64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-12-2                 12.2.128-1                     amd64        CUDA Runtime native dev links, headers
ii  cuda-cuobjdump-12-2                  12.2.128-1                     amd64        CUDA cuobjdump
ii  cuda-cupti-12-2                      12.2.131-1                     amd64        CUDA profiling tools runtime libs.
ii  cuda-cupti-dev-12-2                  12.2.131-1                     amd64        CUDA profiling tools interface.
ii  cuda-cuxxfilt-12-2                   12.2.128-1                     amd64        CUDA cuxxfilt
ii  cuda-demo-suite-12-2                 12.2.128-1                     amd64        Demo suite for CUDA
ii  cuda-documentation-12-2              12.2.128-1                     amd64        CUDA documentation
ii  cuda-driver-dev-12-2                 12.2.128-1                     amd64        CUDA Driver native dev stub library
ii  cuda-drivers                         535.86.10-1                    amd64        CUDA Driver meta-package, branch-agnostic
ii  cuda-drivers-535                     535.86.10-1                    amd64        CUDA Driver meta-package, branch-specific
ii  cuda-gdb-12-2                        12.2.128-1                     amd64        CUDA-GDB
ii  cuda-keyring                         1.1-1                          all          GPG keyring for the CUDA repository
ii  cuda-libraries-12-2                  12.2.1-1                       amd64        CUDA Libraries 12.2 meta-package
ii  cuda-libraries-dev-12-2              12.2.1-1                       amd64        CUDA Libraries 12.2 development meta-package
ii  cuda-nsight-12-2                     12.2.128-1                     amd64        CUDA nsight
ii  cuda-nsight-compute-12-2             12.2.1-1                       amd64        NVIDIA Nsight Compute
ii  cuda-nsight-systems-12-2             12.2.1-1                       amd64        NVIDIA Nsight Systems
ii  cuda-nvcc-12-2                       12.2.128-1                     amd64        CUDA nvcc
ii  cuda-nvdisasm-12-2                   12.2.128-1                     amd64        CUDA disassembler
ii  cuda-nvml-dev-12-2                   12.2.128-1                     amd64        NVML native dev links, headers
ii  cuda-nvprof-12-2                     12.2.131-1                     amd64        CUDA Profiler tools
ii  cuda-nvprune-12-2                    12.2.128-1                     amd64        CUDA nvprune
ii  cuda-nvrtc-12-2                      12.2.128-1                     amd64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-12-2                  12.2.128-1                     amd64        NVRTC native dev links, headers
ii  cuda-nvtx-12-2                       12.2.128-1                     amd64        NVIDIA Tools Extension
ii  cuda-nvvm-12-2                       12.2.128-1                     amd64        CUDA nvvm
ii  cuda-nvvp-12-2                       12.2.131-1                     amd64        CUDA Profiler tools
ii  cuda-opencl-12-2                     12.2.128-1                     amd64        CUDA OpenCL native Libraries
ii  cuda-opencl-dev-12-2                 12.2.128-1                     amd64        CUDA OpenCL native dev links, headers
ii  cuda-profiler-api-12-2               12.2.128-1                     amd64        CUDA Profiler API
ii  cuda-runtime-12-2                    12.2.1-1                       amd64        CUDA Runtime 12.2 meta-package
ii  cuda-sanitizer-12-2                  12.2.128-1                     amd64        CUDA Sanitizer
rc  cuda-toolkit-11-8-config-common      11.8.89-1                      all          Common config package for CUDA Toolkit 11.8.
rc  cuda-toolkit-11-config-common        11.8.89-1                      all          Common config package for CUDA Toolkit 11.
rc  cuda-toolkit-12-1-config-common      12.1.105-1                     all          Common config package for CUDA Toolkit 12.1.
ii  cuda-toolkit-12-2                    12.2.1-1                       amd64        CUDA Toolkit 12.2 meta-package
ii  cuda-toolkit-12-2-config-common      12.2.128-1                     all          Common config package for CUDA Toolkit 12.2.
ii  cuda-toolkit-12-config-common        12.2.128-1                     all          Common config package for CUDA Toolkit 12.
ii  cuda-toolkit-config-common           12.2.128-1                     all          Common config package for CUDA Toolkit.
ii  cuda-tools-12-2                      12.2.1-1                       amd64        CUDA Tools meta-package
rc  cuda-visual-tools-12-1               12.1.1-1                       amd64        CUDA visual tools
ii  cuda-visual-tools-12-2               12.2.1-1                       amd64        CUDA visual tools
ii  glx-alternative-nvidia               1.2.1~deb11u1                  amd64        allows the selection of NVIDIA as GLX provider
ii  libcuda1:amd64                       535.86.10-1                    amd64        NVIDIA CUDA Driver Library
ii  libcudadebugger1:amd64               535.86.10-1                    amd64        NVIDIA CUDA Debugger
ii  libegl-nvidia0:amd64                 535.86.10-1                    amd64        NVIDIA binary EGL library
ii  libgl1-nvidia-glvnd-glx:amd64        535.86.10-1                    amd64        NVIDIA binary OpenGL/GLX library (GLVND variant)
ii  libgles-nvidia1:amd64                535.86.10-1                    amd64        NVIDIA binary OpenGL|ES 1.x library
ii  libgles-nvidia2:amd64                535.86.10-1                    amd64        NVIDIA binary OpenGL|ES 2.x library
ii  libglx-nvidia0:amd64                 535.86.10-1                    amd64        NVIDIA binary GLX library
ii  libnvidia-allocator1:amd64           535.86.10-1                    amd64        NVIDIA allocator runtime library
ii  libnvidia-cfg1:amd64                 535.86.10-1                    amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-container-tools            1.13.5-1                       amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64           1.13.5-1                       amd64        NVIDIA container runtime library
ii  libnvidia-eglcore:amd64              535.86.10-1                    amd64        NVIDIA binary EGL core libraries
ii  libnvidia-encode1:amd64              535.86.10-1                    amd64        NVENC Video Encoding runtime library
ii  libnvidia-fbc1:amd64                 535.86.10-1                    amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-glcore:amd64               535.86.10-1                    amd64        NVIDIA binary OpenGL/GLX core libraries
ii  libnvidia-glvkspirv:amd64            535.86.10-1                    amd64        NVIDIA binary Vulkan Spir-V compiler library
ii  libnvidia-ml1:amd64                  535.86.10-1                    amd64        NVIDIA Management Library (NVML) runtime library
ii  libnvidia-nvvm4:amd64                535.86.10-1                    amd64        NVIDIA NVVM
ii  libnvidia-opticalflow1:amd64         535.86.10-1                    amd64        NVIDIA Optical Flow runtime library
ii  libnvidia-pkcs11:amd64               535.86.10-1                    amd64        NVIDIA pkcs runtime library
ii  libnvidia-ptxjitcompiler1:amd64      535.86.10-1                    amd64        NVIDIA PTX JIT Compiler
ii  libnvidia-rtcore:amd64               535.86.10-1                    amd64        NVIDIA binary Vulkan ray tracing (rtcore) library
ii  libnvidia-wayland-client:amd64       535.86.10-1                    amd64        NVIDIA client for wayland library
ii  nvidia-alternative                   535.86.10-1                    amd64        allows the selection of NVIDIA as GLX provider
ii  nvidia-container-toolkit             1.13.5-1                       amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base        1.13.5-1                       amd64        NVIDIA Container Toolkit Base
ii  nvidia-cuda-mps                      535.86.10-1                    amd64        NVIDIA CUDA Multi Process Service (MPS)
rc  nvidia-cuda-toolkit                  11.2.2-3+deb11u3               amd64        NVIDIA CUDA development toolkit
ii  nvidia-detect                        535.86.10-1                    amd64        NVIDIA GPU detection utility
ii  nvidia-driver                        535.86.10-1                    amd64        NVIDIA metapackage
ii  nvidia-driver-bin                    535.86.10-1                    amd64        NVIDIA driver support binaries
ii  nvidia-driver-libs:amd64             535.86.10-1                    amd64        NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
ii  nvidia-egl-common                    535.86.10-1                    amd64        NVIDIA binary EGL driver - common files
ii  nvidia-egl-icd:amd64                 535.86.10-1                    amd64        NVIDIA EGL installable client driver (ICD)
ii  nvidia-fabricmanager-530             530.30.02-1                    amd64        Fabric Manager for NVSwitch based systems.
ii  nvidia-installer-cleanup             20151021+13                    amd64        cleanup after driver installation with the nvidia-installer
ii  nvidia-kernel-common                 20151021+13                    amd64        NVIDIA binary kernel module support files
ii  nvidia-kernel-dkms                   535.86.10-1                    amd64        NVIDIA binary kernel module DKMS source
ii  nvidia-kernel-support                535.86.10-1                    amd64        NVIDIA binary kernel module support files
ii  nvidia-legacy-check                  535.86.10-1                    amd64        check for NVIDIA GPUs requiring a legacy driver
ii  nvidia-libopencl1:amd64              535.86.10-1                    amd64        NVIDIA OpenCL ICD Loader library
ii  nvidia-modprobe                      535.86.10-1                    amd64        utility to load NVIDIA kernel modules and create device nodes
ii  nvidia-opencl-common                 535.86.10-1                    amd64        NVIDIA OpenCL driver - common files
ii  nvidia-opencl-icd:amd64              535.86.10-1                    amd64        NVIDIA OpenCL installable client driver (ICD)
ii  nvidia-persistenced                  535.86.10-1                    amd64        daemon to maintain persistent software state in the NVIDIA driver
ii  nvidia-settings                      535.86.10-1                    amd64        tool for configuring the NVIDIA graphics driver
ii  nvidia-smi                           535.86.10-1                    amd64        NVIDIA System Management Interface
ii  nvidia-support                       20151021+13                    amd64        NVIDIA binary graphics driver support files
ii  nvidia-vdpau-driver:amd64            535.86.10-1                    amd64        Video Decode and Presentation API for Unix - NVIDIA driver
ii  nvidia-vulkan-common                 535.86.10-1                    amd64        NVIDIA Vulkan driver - common files
ii  nvidia-vulkan-icd:amd64              535.86.10-1                    amd64        NVIDIA Vulkan installable client driver (ICD)
ii  nvidia-xconfig                       535.86.10-1                    amd64        deprecated X configuration tool for non-free NVIDIA drivers
ii  xserver-xorg-video-nvidia            535.86.10-1                    amd64        NVIDIA binary Xorg driver

System Information:

  • Operating System: Debian 11 - Proxmox
  • Kernel Version: Linux 5.15.108-1-pve
  • CPU: 32 x Intel(R) Xeon(R) Silver 4309Y CPU @ 2.80GHz (2 Sockets
  • Nvidia Card: A100 80GB

References:

@klueska
Copy link
Contributor

klueska commented Aug 3, 2023

I'm not sure why it's happening, but the error stems from:

E0803 12:09:42.281675 119930 nvc.c:292] could not create kernel module device nodes: error running mknod for nvcap: /proc/driver/nvidia/capabilities/mig/config

Once it fails on creating the first nvcap device it will not attempt to create any more (and these nvcap devices are needed to enumerate / access MIG devices).

@klueska
Copy link
Contributor

klueska commented Aug 3, 2023

Here is the point in the code where this error occurs:
https://github.com/NVIDIA/libnvidia-container/blob/f7fb88c5571e9e6089c5e36982449dac0d774bba/src/nvc.c#L291

@klueska
Copy link
Contributor

klueska commented Aug 3, 2023

What does the following command show for you:

$ cat /proc/driver/nvidia/capabilities/mig/config
DeviceFileMinor: 1
DeviceFileMode: 256
DeviceFileModify: 1

Also this:

$ cat /proc/driver/nvidia/params
ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 1
EnableGpuFirmware: 1
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 0
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""

@sargreal
Copy link
Author

sargreal commented Aug 3, 2023

This:

cat /proc/driver/nvidia/capabilities/mig/config
DeviceFileMinor: 1
DeviceFileMode: 256
DeviceFileModify: 1

And This:

root@gpu1:~# cat /proc/driver/nvidia/params
ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 0
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""

If I am not mistaken, these are the only differences:

EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 18

@Lanyujiex
Copy link

have any ideas?
i also encountered this error in k8s.

@Lanyujiex
Copy link

When I reinstalled the driver, the problem was solved

@sargreal
Copy link
Author

I found my issue. From another Guide somewhere on the internet, I had the following udev rule:

KERNEL=="nvidia", RUN+="/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia* && /usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*'"

After removing that and just restarting it worked!

So takeaway is: do not fiddle with the nvidia devices and don't run nvidia-modprobe yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants