Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Failures on Debian with ldconfig #1399

Closed
5 of 6 tasks
brycelelbach opened this issue Oct 14, 2020 · 47 comments
Closed
5 of 6 tasks

Failures on Debian with ldconfig #1399

brycelelbach opened this issue Oct 14, 2020 · 47 comments
Assignees

Comments

@brycelelbach
Copy link

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Also, before reporting a new issue, please make sure that:


1. Issue or feature description

On Debian 10 and Debian unstable, nvidia-docker fails to run programs that use CUDA inside of containers UNLESS ldconfig is run first in the container to rebuild the the ldconfig cache.

Example failure:

[17:07:54]:wash@voyager:/home/wash/development/nvidia/cuda_linux_p4/sw/gpgpu/thrust/ci:0:$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi           NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
[17:07:57]:wash@voyager:/home/wash/development/nvidia/cuda_linux_p4/sw/gpgpu/thrust/ci:12:$ tail -n 5 /var/log/nvidia-container-toolkit.log
I1014 00:07:56.925518 4001429 nvc_ldcache.c:359] executing /sbin/ldconfig from host at /var/lib/docker/overlay2/1b23287eb935d89df1baab6e66ded34209ac3f6a371ccb11c307a553bd11cff4/merged
E1014 00:07:56.926236 1 nvc_ldcache.c:390] could not start /sbin/ldconfig: process execution failed: no such file or directory
I1014 00:07:56.943973 4001429 nvc.c:337] shutting down library context
I1014 00:07:56.944378 4001435 driver.c:156] terminating driver service
I1014 00:07:56.944613 4001429 driver.c:196] driver service terminated successfully

If I run ldconfig within the container to rebuild ld.so.cache first, everything works:

[17:09:00]:wash@voyager:/home/wash/development/nvidia/cuda_linux_p4/sw/gpgpu/thrust/ci:0:$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base bash -c "ldconfig && nvidia-smi"
Wed Oct 14 00:11:34 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.10       Driver Version: 455.10       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GT 710      On   | 00000000:04:00.0 N/A |                  N/A |
| 40%   41C    P8    N/A /  N/A |      1MiB /  2002MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            On   | 00000000:17:00.0 Off |                  N/A |
| 23%   30C    P8     8W / 250W |      1MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro GV100        On   | 00000000:65:00.0  On |                  Off |
| 32%   44C    P0    26W / 250W |      0MiB / 32505MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This seems related to:

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
I1014 00:13:02.642845 4001640 nvc.c:282] initializing library context (version=1.3.0, build=16315ebdf4b9728e899f615e208b50c41d7a5d15)
I1014 00:13:02.642869 4001640 nvc.c:256] using root /
I1014 00:13:02.642873 4001640 nvc.c:257] using ldcache /etc/ld.so.cache
I1014 00:13:02.642876 4001640 nvc.c:258] using unprivileged user 1000:1000
I1014 00:13:02.642887 4001640 nvc.c:299] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I1014 00:13:02.642982 4001640 nvc.c:301] dxcore initialization failed, continuing assuming a non-WSL environment
W1014 00:13:02.644169 4001641 nvc.c:187] failed to set inheritable capabilities
W1014 00:13:02.644192 4001641 nvc.c:188] skipping kernel modules load due to failure
I1014 00:13:02.644291 4001642 driver.c:101] starting driver service
I1014 00:13:02.645319 4001640 nvc_info.c:680] requesting driver information with ''
I1014 00:13:02.646053 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.455.10
I1014 00:13:02.646118 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.455.10
I1014 00:13:02.646144 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.455.10
I1014 00:13:02.646162 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.455.10
I1014 00:13:02.646182 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.455.10
I1014 00:13:02.646210 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.455.10
I1014 00:13:02.646239 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.455.10
I1014 00:13:02.646258 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.455.10
I1014 00:13:02.646278 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.455.10
I1014 00:13:02.646305 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.455.10
I1014 00:13:02.646335 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.455.10
I1014 00:13:02.646354 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.455.10
I1014 00:13:02.646374 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.455.10
I1014 00:13:02.646393 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.455.10
I1014 00:13:02.646421 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.455.10
I1014 00:13:02.646448 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.455.10
I1014 00:13:02.646467 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.455.10
I1014 00:13:02.646487 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.455.10
I1014 00:13:02.646514 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.455.10
I1014 00:13:02.646532 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.455.10
I1014 00:13:02.646562 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.455.10
I1014 00:13:02.646670 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.455.10
I1014 00:13:02.646765 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.455.10
I1014 00:13:02.646786 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.455.10
I1014 00:13:02.646807 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.455.10
I1014 00:13:02.646828 4001640 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.455.10
I1014 00:13:02.646853 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/vdpau/libvdpau_nvidia.so.455.10
I1014 00:13:02.646880 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.455.10
I1014 00:13:02.646898 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.455.10
I1014 00:13:02.646925 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.455.10
I1014 00:13:02.646952 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.455.10
I1014 00:13:02.646971 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.455.10
I1014 00:13:02.646998 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-ifr.so.455.10
I1014 00:13:02.647027 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.455.10
I1014 00:13:02.647045 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.455.10
I1014 00:13:02.647064 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.455.10
I1014 00:13:02.647083 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.455.10
I1014 00:13:02.647110 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.455.10
I1014 00:13:02.647137 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.455.10
I1014 00:13:02.647156 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.455.10
I1014 00:13:02.647175 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvidia-allocator.so.455.10
I1014 00:13:02.647204 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.455.10
I1014 00:13:02.647242 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libcuda.so.455.10
I1014 00:13:02.647278 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.455.10
I1014 00:13:02.647297 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.455.10
I1014 00:13:02.647317 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.455.10
I1014 00:13:02.647337 4001640 nvc_info.c:169] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.455.10
W1014 00:13:02.647351 4001640 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W1014 00:13:02.647356 4001640 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W1014 00:13:02.647360 4001640 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W1014 00:13:02.647363 4001640 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W1014 00:13:02.647366 4001640 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W1014 00:13:02.647370 4001640 nvc_info.c:354] missing compat32 library libnvoptix.so
W1014 00:13:02.647373 4001640 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I1014 00:13:02.655816 4001640 nvc_info.c:276] selecting /usr/bin/nvidia-smi
I1014 00:13:02.655827 4001640 nvc_info.c:276] selecting /usr/bin/nvidia-debugdump
I1014 00:13:02.655839 4001640 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
I1014 00:13:02.655849 4001640 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-control
I1014 00:13:02.655859 4001640 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-server
I1014 00:13:02.655874 4001640 nvc_info.c:438] listing device /dev/nvidiactl
I1014 00:13:02.655877 4001640 nvc_info.c:438] listing device /dev/nvidia-uvm
I1014 00:13:02.655882 4001640 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I1014 00:13:02.655887 4001640 nvc_info.c:438] listing device /dev/nvidia-modeset
I1014 00:13:02.655928 4001640 nvc_info.c:317] listing ipc /run/nvidia-persistenced/socket
W1014 00:13:02.655937 4001640 nvc_info.c:321] missing ipc /tmp/nvidia-mps
I1014 00:13:02.655941 4001640 nvc_info.c:745] requesting device information with ''
I1014 00:13:02.661724 4001640 nvc_info.c:628] listing device /dev/nvidia0 (GPU-858ec672-5669-6e20-d0e8-194029d32d2c at 00000000:04:00.0)
I1014 00:13:02.667422 4001640 nvc_info.c:628] listing device /dev/nvidia1 (GPU-2da062d6-3b80-9750-0af9-85d39d0b010b at 00000000:17:00.0)
I1014 00:13:02.673130 4001640 nvc_info.c:628] listing device /dev/nvidia2 (GPU-58a70c9d-1070-2a96-e5b3-cbee8d19d9e3 at 00000000:65:00.0)
NVRM version:   455.10
CUDA version:   11.1

Device Index:   0
Device Minor:   0
Model:          GeForce GT 710
Brand:          GeForce
GPU UUID:       GPU-858ec672-5669-6e20-d0e8-194029d32d2c
Bus Location:   00000000:04:00.0
Architecture:   3.5

Device Index:   1
Device Minor:   1
Model:          TITAN Xp
Brand:          GeForce
GPU UUID:       GPU-2da062d6-3b80-9750-0af9-85d39d0b010b
Bus Location:   00000000:17:00.0
Architecture:   6.1

Device Index:   2
Device Minor:   2
Model:          Quadro GV100
Brand:          Quadro
GPU UUID:       GPU-58a70c9d-1070-2a96-e5b3-cbee8d19d9e3
Bus Location:   00000000:65:00.0
Architecture:   7.0
I1014 00:13:02.673174 4001640 nvc.c:337] shutting down library context
I1014 00:13:02.673531 4001642 driver.c:156] terminating driver service
I1014 00:13:02.673681 4001640 driver.c:196] driver service terminated successfully
  • Kernel version from uname -a
Linux voyager 5.5.0-1-amd64 NVIDIA/nvidia-docker#1 SMP Debian 5.5.13-2 (2020-03-30) x86_64 GNU/Linux
  • Driver information from nvidia-smi -a
Timestamp                                 : Tue Oct 13 17:15:06 2020
Driver Version                            : 455.10
CUDA Version                              : 11.1

Attached GPUs                             : 3
GPU 00000000:04:00.0
    Product Name                          : GeForce GT 710
    Product Brand                         : GeForce
    Display Mode                          : N/A
    Display Active                        : N/A
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : N/A
    Accounting Mode Buffer Size           : N/A
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-858ec672-5669-6e20-d0e8-194029d32d2c
    Minor Number                          : 0
    VBIOS Version                         : 80.28.A6.00.12
    MultiGPU Board                        : N/A
    Board ID                              : N/A
    GPU Part Number                       : N/A
    Inforom Version
        Image Version                     : N/A
        OEM Object                        : N/A
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : N/A
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x04
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x128B10DE
        Bus Id                            : 00000000:04:00.0
        Sub System Id                     : 0x27123842
        GPU Link Info
            PCIe Generation
                Max                       : N/A
                Current                   : N/A
            Link Width
                Max                       : N/A
                Current                   : N/A
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : N/A
        Rx Throughput                     : N/A
    Fan Speed                             : 40 %
    Performance State                     : P8
    Clocks Throttle Reasons               : N/A
    FB Memory Usage
        Total                             : 2002 MiB
        Used                              : 1 MiB
        Free                              : 2001 MiB
    BAR1 Memory Usage
        Total                             : N/A
        Used                              : N/A
        Free                              : N/A
    Compute Mode                          : Default
    Utilization
        Gpu                               : N/A
        Memory                            : N/A
        Encoder                           : N/A
        Decoder                           : N/A
    Encoder Stats
        Active Sessions                   : N/A
        Average FPS                       : N/A
        Average Latency                   : N/A
    FBC Stats
        Active Sessions                   : N/A
        Average FPS                       : N/A
        Average Latency                   : N/A
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 41 C
        GPU Shutdown Temp                 : N/A
        GPU Slowdown Temp                 : N/A
        GPU Max Operating Temp            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : N/A
        Power Draw                        : N/A
        Power Limit                       : N/A
        Default Power Limit               : N/A
        Enforced Power Limit              : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : N/A
        SM                                : N/A
        Memory                            : N/A
        Video                             : N/A
    Applications Clocks
        Graphics                          : 954 MHz
        Memory                            : 900 MHz
    Default Applications Clocks
        Graphics                          : 954 MHz
        Memory                            : 900 MHz
    Max Clocks
        Graphics                          : N/A
        SM                                : N/A
        Memory                            : N/A
        Video                             : N/A
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

GPU 00000000:17:00.0
    Product Name                          : TITAN Xp
    Product Brand                         : GeForce
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-2da062d6-3b80-9750-0af9-85d39d0b010b
    Minor Number                          : 1
    VBIOS Version                         : 86.02.49.00.00
    MultiGPU Board                        : No
    Board ID                              : 0x1700
    GPU Part Number                       : N/A
    Inforom Version
        Image Version                     : G001.0000.01.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x17
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1B0210DE
        Bus Id                            : 00000000:17:00.0
        Sub System Id                     : 0x11DF10DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 23 %
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 12196 MiB
        Used                              : 1 MiB
        Free                              : 12195 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 5 MiB
        Free                              : 251 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 30 C
        GPU Shutdown Temp                 : 99 C
        GPU Slowdown Temp                 : 96 C
        GPU Max Operating Temp            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 9.77 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 125.00 W
        Max Power Limit                   : 300.00 W
    Clocks
        Graphics                          : 139 MHz
        SM                                : 139 MHz
        Memory                            : 405 MHz
        Video                             : 544 MHz
    Applications Clocks
        Graphics                          : 1404 MHz
        Memory                            : 5705 MHz
    Default Applications Clocks
        Graphics                          : 1404 MHz
        Memory                            : 5705 MHz
    Max Clocks
        Graphics                          : 1911 MHz
        SM                                : 1911 MHz
        Memory                            : 5705 MHz
        Video                             : 1620 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None

GPU 00000000:65:00.0
    Product Name                          : Quadro GV100
    Product Brand                         : Quadro
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 0320319013520
    GPU UUID                              : GPU-58a70c9d-1070-2a96-e5b3-cbee8d19d9e3
    Minor Number                          : 2
    VBIOS Version                         : 88.00.5A.00.03
    MultiGPU Board                        : No
    Board ID                              : 0x6500
    GPU Part Number                       : 900-5G500-0000-000
    Inforom Version
        Image Version                     : G500.0500.00.05
        OEM Object                        : 1.1
        ECC Object                        : 5.0
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x65
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x1DBA10DE
        Bus Id                            : 00000000:65:00.0
        Sub System Id                     : 0x121A10DE
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
    Fan Speed                             : 32 %
    Performance State                     : P0
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 32505 MiB
        Used                              : 0 MiB
        Free                              : 32505 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 3 MiB
        Free                              : 253 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    Ecc Mode
        Current                           : Disabled
        Pending                           : Disabled
    ECC Errors
        Volatile
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
        Aggregate
            Single Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
            Double Bit
                Device Memory             : N/A
                Register File             : N/A
                L1 Cache                  : N/A
                L2 Cache                  : N/A
                Texture Memory            : N/A
                Texture Shared            : N/A
                CBU                       : N/A
                Total                     : N/A
    Retired Pages
        Single Bit ECC                    : 0
        Double Bit ECC                    : 0
        Pending Page Blacklist            : No
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 44 C
        GPU Shutdown Temp                 : 90 C
        GPU Slowdown Temp                 : 88 C
        GPU Max Operating Temp            : 87 C
        Memory Current Temp               : 42 C
        Memory Max Operating Temp         : 95 C
    Power Readings
        Power Management                  : Supported
        Power Draw                        : 26.83 W
        Power Limit                       : 250.00 W
        Default Power Limit               : 250.00 W
        Enforced Power Limit              : 250.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 250.00 W
    Clocks
        Graphics                          : 135 MHz
        SM                                : 135 MHz
        Memory                            : 850 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : 1132 MHz
        Memory                            : 850 MHz
    Default Applications Clocks
        Graphics                          : 1132 MHz
        Memory                            : 850 MHz
    Max Clocks
        Graphics                          : 1912 MHz
        SM                                : 1912 MHz
        Memory                            : 850 MHz
        Video                             : 1717 MHz
    Max Customer Boost Clocks
        Graphics                          : 1912 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Processes                             : None
  • Docker version from docker version
Client: Docker Engine - Community
 Version:           19.03.8
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        afacb8b7f0
 Built:             Wed Mar 11 01:26:02 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.12
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.10
  Git commit:       48a66213fe
  Built:            Mon Jun 22 15:44:23 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                          Version      Architecture Description
+++-=============================-============-============-=====================================================
un  libgldispatch0-nvidia         <none>       <none>       (no description available)
ii  libnvidia-container-tools     1.3.0-1      amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64    1.3.0-1      amd64        NVIDIA container runtime library
ii  nvidia-container-runtime      3.4.0-1      amd64        NVIDIA container runtime
un  nvidia-container-runtime-hook <none>       <none>       (no description available)
ii  nvidia-container-toolkit      1.3.0-1      amd64        NVIDIA container runtime hook
un  nvidia-docker                 <none>       <none>       (no description available)
ii  nvidia-docker2                2.5.0-1      all          nvidia-docker CLI wrapper
un  nvidia-libopencl1-dev         <none>       <none>       (no description available)

My installation of the display driver and CUDA is a local debug build from source and is rougly CUDA 11.0 / R455.

  • NVIDIA container library version from nvidia-container-cli -V
version: 1.3.0
build date: 2020-09-16T12:33+00:00
build revision: 16315ebdf4b9728e899f615e208b50c41d7a5d15
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
brycelelbach added a commit to NVIDIA/cccl that referenced this issue Oct 14, 2020
@dualvtable
Copy link
Contributor

@klueska - do you have any insight into this issue?

@nvjmayo nvjmayo self-assigned this Oct 30, 2020
@nvjmayo
Copy link
Contributor

nvjmayo commented Oct 30, 2020

In your example I see you use sudo:
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Sudo should not be necessary to run this, and I won't be using it in my examples below.

Here's an attempt with an old CUDA 10.2 image that does work on Debian 10 and an older version of nvidia-docker (1.2.0):

~$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
~$ nvidia-container-cli -V
version: 1.2.0
build date: 2020-07-08T19:33+00:00
build revision: d22237acaea94aa5ad5de70aac903534ed598819
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
~$ docker run --rm --gpus all nvidia/cuda:10.2-base nvidia-smi
Unable to find image 'nvidia/cuda:10.2-base' locally
10.2-base: Pulling from nvidia/cuda
171857c49d0f: Pull complete
419640447d26: Pull complete
61e52f862619: Pull complete
c118dad7e37a: Pull complete
29c091e4be16: Pull complete
d85c81a4428d: Pull complete
Digest: sha256:1774efa6e102a9bdfa393521c6b2254ea751dc1af9cff25c51376128bfb08d5e
Status: Downloaded newer image for nvidia/cuda:10.2-base
Fri Oct 30 19:59:35 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.31       Driver Version: 440.31       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 710      Off  | 00000000:65:00.0 N/A |                  N/A |
| 50%   40C    P0    N/A /  N/A |      0MiB /  2001MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+

I'm leaning towards these possibilities:

  • insufficient support in the version of nvidia-docker you have for CUDA 11 library paths.
  • failure of nvidia-docker to detect ubuntu/debian versus RHEL/CentOS path conventions. (seems unlikely, but the logic can be a little fragile so worth me looking into)
  • regression in debian support on 1.3.0 versus 1.2.0 (see above)

@klueska
Copy link
Contributor

klueska commented Oct 30, 2020

I am also unable to reproduce this on a fresh debian10 system with the latest drivers and the latest nvidia-docker2.

My entire history of commands from the time I brought the system online until I executed docker:

    1  sudo apt-get install build-essential
    2  sudo apt-get install linux-headers-$(uname -r)
    3  wget https://us.download.nvidia.com/tesla/450.80.02/NVIDIA-Linux-x86_64-450.80.02.run
    4  chmod a+x NVIDIA-Linux-x86_64-450.80.02.run
    5  sudo ./NVIDIA-Linux-x86_64-450.80.02.run
    6  curl https://get.docker.com | sh   && sudo systemctl start docker   && sudo systemctl enable docker
    7  distribution=$(. /etc/os-release;echo $ID$VERSION_ID)    && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    8  sudo apt-get update
    9  sudo apt-get install -y nvidia-docker2
   10  sudo systemctl restart docker
   11  sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

With output:

$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Unable to find image 'nvidia/cuda:11.0-base' locally
11.0-base: Pulling from nvidia/cuda
54ee1f796a1e: Pull complete
f7bfea53ad12: Pull complete
46d371e02073: Pull complete
b66c17bbf772: Pull complete
3642f1a6dfb3: Pull complete
e5ce55b8b4b9: Pull complete
155bc0332b0a: Pull complete
Digest: sha256:774ca3d612de15213102c2dbbba55df44dc5cf9870ca2be6c6e9c627fa63d67a
Status: Downloaded newer image for nvidia/cuda:11.0-base
Fri Oct 30 22:00:59 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  T4 32GB             Off  | 00000000:65:00.0 Off |                    0 |
| N/A   38C    P0    19W / 150W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

System info:

$ sudo lsb_release -a
No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 10 (buster)
Release:	10

Package info:

$ dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                             Version      Architecture Description
+++-================================-============-============-=====================================================
un  libgldispatch0-nvidia            <none>       <none>       (no description available)
ii  libnvidia-container-tools        1.3.0-1      amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64       1.3.0-1      amd64        NVIDIA container runtime library
ii  nvidia-container-runtime         3.4.0-1      amd64        NVIDIA container runtime
un  nvidia-container-runtime-hook    <none>       <none>       (no description available)
ii  nvidia-container-toolkit         1.3.0-1      amd64        NVIDIA container runtime hook
un  nvidia-docker                    <none>       <none>       (no description available)
ii  nvidia-docker2                   2.5.0-1      all          nvidia-docker CLI wrapper
un  nvidia-legacy-304xx-vdpau-driver <none>       <none>       (no description available)
un  nvidia-legacy-340xx-vdpau-driver <none>       <none>       (no description available)
un  nvidia-vdpau-driver              <none>       <none>       (no description available)

Contents of /etc/nvidia-container-runtime/config.toml (the default for debian10):

disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"

@xkszltl
Copy link

xkszltl commented Nov 27, 2020

Kind of off the topic but, am I seeing a 16GB GPU named as T4 32GB?
image

@3ronco
Copy link

3ronco commented Nov 29, 2020

I don't think it's a problem with ldconfig but with the docker-nvidia-runtime prehook which provides the necessary driver libs from the host dynamically and depends on correct configuration of NVIDIA_VISIBLE_DEVICES & NVIDIA_DRIVER_CAPABILITIES. Notice that my config.toml is in original state (with the @ in ldconfig):

disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"

1) Logically fails when you omit both NVIDIA_VISIBLE_DEVICES & NVIDIA_DRIVER_CAPABILITIES:

docker run -ti --rm --privileged \
	--runtime=nvidia \
	nvidia/libnvidia-container/debian10-amd64;
me@container:/tmp/libnvidia-container# nvidia-container-cli list
nvidia-container-cli: error while loading shared libraries: libnvidia-container.so.1: cannot open shared object file: No such file or directory

2) With an NVIDIA_VISIBLE_DEVICES set to first device.

docker run -ti --rm --privileged --runtime=nvidia \
		-e NVIDIA_VISIBLE_DEVICES=0 \
		nvidia/libnvidia-container/debian10-amd64;

So by default without NVIDIA_DRIVER_CAPABILITIES it loads the config & management libs:

me@container:/tmp/libnvidia-container# ll /usr/lib/x86_64-linux-gnu/ | grep libnv
lrwxrwxrwx 1 root root   27 Nov 29 16:29 libnvidia-cfg.so.1 -> libnvidia-cfg.so.418.152.00
-rw-r--r-- 1 root root 187K Jun  2 02:42 libnvidia-cfg.so.418.152.00
lrwxrwxrwx 1 root root   26 Nov 29 16:29 libnvidia-ml.so.1 -> libnvidia-ml.so.418.152.00
-rw-r--r-- 1 root root 1.6M Jun  2 02:46 libnvidia-ml.so.418.152.00

By the way, don't confuse the index order with the order at host even when your host has two GPUs with the nvidia device being the second one the index still starts at 0 as NVIDIA_VISIBLE_DEVICES naturally sees only nvidia devices.

3) Here's a strange case, you may give an invalid list for NVIDIA_DRIVER_CAPABILITIES

docker run -ti --rm --privileged --runtime=nvidia \
		-e NVIDIA_VISIBLE_DEVICES=0 \
		-e NVIDIA_DRIVER_CAPABILITIES=video,compat32,graphics,display \
		nvidia/libnvidia-container/debian10-amd64;
me@container:/tmp/libnvidia-container# nvidia-container-cli list
nvidia-container-cli: initialization error: driver error: failed to process request

Although some of the libs got loaded, ...

me@container:/tmp/libnvidia-container# ll /usr/lib/x86_64-linux-gnu/ | grep libnv
-rw-r--r-- 1 root root  25M Jun  2 03:07 libnvidia-eglcore.so.418.152.00
-rw-r--r-- 1 root root  27M Jun  2 02:51 libnvidia-glcore.so.418.152.00
-rw-r--r-- 1 root root 653K Jun  2 02:49 libnvidia-glsi.so.418.152.00
-rw-r--r-- 1 root root  14M Jun  2 03:03 libnvidia-glvkspirv.so.418.152.00
-rw-r--r-- 1 root root  15K Jun  2 02:30 libnvidia-tls.so.418.152.00

... nvidia-container-cli still fails because i've omitted utility in the list (!)

4) While with all capabilites utility,video,compat32,graphics,display or simply all

docker run -ti --rm --privileged --runtime=nvidia \
		-e NVIDIA_VISIBLE_DEVICES=0 \
		-e NVIDIA_DRIVER_CAPABILITIES=utility,video,compat32,graphics,display \
		nvidia/libnvidia-container/debian10-amd64;

me@container:tmp/libnvidia-container# ll /usr/lib/x86_64-linux-gnu/ | grep libnv
lrwxrwxrwx 1 root root   27 Nov 29 16:49 libnvidia-cfg.so.1 -> libnvidia-cfg.so.418.152.00
-rw-r--r-- 1 root root 187K Jun  2 02:42 libnvidia-cfg.so.418.152.00
-rw-r--r-- 1 root root  25M Jun  2 03:07 libnvidia-eglcore.so.418.152.00
-rw-r--r-- 1 root root  27M Jun  2 02:51 libnvidia-glcore.so.418.152.00
-rw-r--r-- 1 root root 653K Jun  2 02:49 libnvidia-glsi.so.418.152.00
-rw-r--r-- 1 root root  14M Jun  2 03:03 libnvidia-glvkspirv.so.418.152.00
lrwxrwxrwx 1 root root   26 Nov 29 16:49 libnvidia-ml.so.1 -> libnvidia-ml.so.418.152.00
-rw-r--r-- 1 root root 1.6M Jun  2 02:46 libnvidia-ml.so.418.152.00
-rw-r--r-- 1 root root  15K Jun  2 02:30 libnvidia-tls.so.418.152.00

me@container:tmp/libnvidia-container# nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.418.152.00
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.418.152.00
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.418.152.00
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.418.152.00
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.418.152.00
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.418.152.00
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.418.152.00
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.418.152.00
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.418.152.00
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.418.152.00
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.418.152.00
/run/nvidia-persistenced/socket

It works as expected. Since the nvidia-container-cli depends on the management lib libnvidia-ml.so.1 you should consider providing that lib always regardless of what is given in NVIDIA_DRIVER_CAPABILITIES env var.

@klueska klueska mentioned this issue Dec 1, 2020
9 tasks
@klueska
Copy link
Contributor

klueska commented Dec 1, 2020

@3ronco What you are saying is mostly correct, except that the exact error being reported by this issue:

could not start /sbin/ldconfig: process execution failed: no such file or directory

and not:

nvidia-container-cli: error while loading shared libraries: libnvidia-container.so.1: cannot open shared object file: No such file or directory
or
nvidia-container-cli: initialization error: driver error: failed to process request

Which means they are running a container that definitely triggers the nvidia-container-toolkit to be invoked properly. In fact, they are running the nvidia/cuda:11.0-base image, which definitely has sufficient values for NVIDIA_DRIVER_CAPABILITIES and NVIDIA_VISIBILE_DEVICE set in it.

Also, this issue only seems to appear on debian 10 systems and not others.
Unfortunately, I am unable to reproduce this issue (even on debian 10), so it is hard for me to root cause it.
If anyone can give me more info on the particulars of their system so I can reproduce it, that would go a long way towards resolving this (somewhat longstanding) issue.

@3ronco
Copy link

3ronco commented Dec 4, 2020

@klueska ok, i see ... i've used an image built from source (nvidia/libnvidia-container/debian10-amd64) ... may explain the missing setting for the env vars in my setup but like you i'm unable to reproduce the issue the debug output from /var/log/nvidia-container-toolkit.log gives no error.

... nvc_ldcache.c:359] executing /sbin/ldconfig from host at /var/lib/docker/btrfs/subvolumes/c068652...

@Dikkepanda
Copy link

Dikkepanda commented Dec 17, 2020

I am having similar problems. This is on a Debian installed with OMV . When I change @/ldconfig to /ldconfig (which I often encounter), I have the following message in my logs:

/bin/sh: error while loading shared libraries: /lib/x86_64-linux-gnu/libc.so.6: cannot read file data: Operation not permitted

@rjeli
Copy link

rjeli commented Feb 12, 2021

it's because when trying to execveat host ldconfig, security_bprm_check in the kernel returns -ENOENT. not sure why yet.

@rjeli
Copy link

rjeli commented Feb 12, 2021

it's not AppArmor, it still happens with apparmor=0 boot param. not sure how else to trace who is denying the LSM hook.

@klueska
Copy link
Contributor

klueska commented Feb 12, 2021

I think the next logical step to debug this is to do the following:

  1. Move /usr/bin/nvidia-container-cli to /usr/bin/nvidia-container-cli.real
  2. Create a new script at /usr/bin/nvidia-container-cli with the following contents:
 #!/usr/bin/env bash
strace -f nvidia-container-cli.real "${@}" > /tmp/nvidia-container-cli.strace 2>&1
  1. Run a container that exhibits the issue you have been seeing
  2. Inspect /tmp/nvidia-container-cli.strace to see where any errors crop up around calls out to ldconfig

This will show all calls made into linux and if/why they failed.

@rjeli
Copy link

rjeli commented Feb 12, 2021

It's the fexecve call at https://github.com/NVIDIA/libnvidia-container/blob/bd9fc3f2b642345301cb2e23de07ec5386232317/src/nvc_ldcache.c#L388, it returns -ENOENT according to nvidia logs. I used bpftrace to figure out where that was coming from inside the execveat syscall, and it's the LSM hook returning -ENOENT (https://elixir.bootlin.com/linux/v5.9/source/security/security.c#L840). could be something related to AppArmor being enabled by default starting with buster? but apparmor audit logs are empty and still happens with apparmor=0. should we move this to libnvidia-container?

@rjeli
Copy link

rjeli commented Feb 12, 2021

i should try and see if it happens with buster 4.19 kernel, since most reports seem to be from buster-backports 5.9 kernels.

@rjeli
Copy link

rjeli commented Feb 13, 2021

works on buster w/ 4.19, driver 460.32:

eli@casper:~$ uname -a
Linux casper 4.19.0-13-amd64 NVIDIA/nvidia-docker#1 SMP Debian 4.19.160-2 (2020-11-28) x86_64 GNU/Linux
eli@casper:~$ dpkg -l | grep 'nvidia-\(docker\|driver\|container\)'
ii  libnvidia-container-tools                                   1.3.3-1                                                         amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                                  1.3.3-1                                                         amd64        NVIDIA container runtime library
ii  nvidia-container-runtime                                    3.4.2-1                                                         amd64        NVIDIA container runtime
ii  nvidia-container-toolkit                                    1.4.2-1                                                         amd64        NVIDIA container runtime hook
ii  nvidia-docker2                                              2.5.0-1                                                         all          nvidia-docker CLI wrapper
ii  nvidia-driver                                               460.32.03-1~bpo10+1                                             amd64        NVIDIA metapackage
ii  nvidia-driver-bin                                           460.32.03-1~bpo10+1                                             amd64        NVIDIA driver support binaries
ii  nvidia-driver-libs:amd64                                    460.32.03-1~bpo10+1                                             amd64        NVIDIA metapackage (OpenGL/GLX/EGL/GLES libraries)
eli@casper:~$ nvidia-smi
Sat Feb 13 11:30:20 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:07:00.0  On |                  N/A |
|  0%   39C    P8    16W / 250W |    322MiB / 11175MiB |     27%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1667      G   /usr/lib/xorg/Xorg                211MiB |
|    0   N/A  N/A      4227      G   ...AAAAAAAAA= --shared-files       43MiB |
+-----------------------------------------------------------------------------+
eli@casper:~$ docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Sat Feb 13 19:30:24 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:07:00.0  On |                  N/A |
|  0%   38C    P8    15W / 250W |    322MiB / 11175MiB |     28%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

@klueska
Copy link
Contributor

klueska commented Feb 18, 2021

Hi @rjeli ,

Thanks for taking the time to dig into this. So it seems that the reason I was never able to reproduce this issue is because I was using the standard buster distribution (the one we support) and not buster-backports (which we don't).

Regarding your comment on:

should we move this to libnvidia-container?

Whether it's reported here or there doesn't really matter -- the same set of people look at the issues.

@rjeli
Copy link

rjeli commented Feb 22, 2021

Cool. I'm going to stick with the buster kernel so I don't have reason to look into it further, but it might come up in the future for people using newer kernels over time, especially since backports hosts newer nvidia drivers and people might do a full upgrade to backports packages without realizing. 👍

@brycelelbach
Copy link
Author

I've filed an internal NVIDIA bug (3264046) to track this, as it's impacting a lot of users, as well as my development teams at NVIDIA.

@cuyax1975
Copy link

Has anyone successfully gotten this working with Emby in a docker environment? I have a P400 that I just bought and have just about given up after over a week of trying everything I can based on these threads.

@rjeli
Copy link

rjeli commented Mar 20, 2021

@cuyax1975 no idea about emby but i had success uninstalling backports kernel, and uninstall/reinstalling backports nvidia-dkms so it builds against 4.19 kernel. cuda works

@cuyax1975
Copy link

Thanks. I am a total newb on Linux. What would be good to google to figure out how to step through what you described?

@bingzhangdai
Copy link

@cuyax1975 Maybe you can try this workaround: #1163 (comment). I do that for my Jellyfin docker server.

@smthpickboy
Copy link

Any update on this?

1 similar comment
@danijar
Copy link

danijar commented Jun 18, 2021

Any update on this?

@danijar
Copy link

danijar commented Jun 19, 2021

A workaround that worked in my case was to both replace @/sbin/ldconfig with /sbin/ldconfig and then run the container with the --privileged flag.

@ghost
Copy link

ghost commented Jul 31, 2021

A workaround that worked in my case was to both replace @/sbin/ldconfig with /sbin/ldconfig and then run the container with the --privileged flag.

Does this work if the container doesn't contain ldconfig locally? If not can I load "ldconfig" to the container manually to make it work? Running debian with the backported kernel:
5.10.0-0.bpo.4-amd64

@danijar
Copy link

danijar commented Jul 31, 2021

I don't know.

@peteflorence
Copy link

I just want to confirm that this does work for me, as suggested by original poster, to add ldconfig:

sudo docker run --rm --gpus all nvidia/cuda:11.0-base bash -c "ldconfig && nvidia-smi"

Meanwhile this doesn't work:

sudo docker run --rm --gpus all nvidia/cuda:11.0-base bash -c "nvidia-smi"

bwLehrpool-API pushed a commit to bwLehrpool/mltk that referenced this issue Dec 17, 2021
@gh2o
Copy link

gh2o commented Jan 3, 2022

I ran into the same issue. The -ENOENT appears to be returned by the tomoyo lsm, as in tomoyo_find_next_domain() appears to be returning -ENOENT for some reason. Disabling tomoyo fixes the issue. I disabled tomoyo by adding lsm=lockdown,capability,yama to my kernel command line.

@mikafouenski
Copy link

mikafouenski commented Feb 15, 2022

Hello, 👋

I'm sorry to add again an entry "me too", but this problem is starting to be an issue for us.

Maybe I was mislead by #1537 (comment), but @klueska seemed to imply that this was working with debian 11. 🤔

I'm still facing this issue...

Details on my setup
$ sudo apt list --installed | grep container                           
containerd.io/bullseye,now 1.4.12-1 amd64 [installed]
libnvidia-container-tools/buster,now 1.8.1-1 amd64 [installed]
libnvidia-container1/buster,now 1.8.1-1 amd64 [installed]
nvidia-container-toolkit/buster,now 1.8.1-1 amd64 [installed,automatic]
$ cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
$ cat /etc/apt/sources.list.d/nvidia-docker.list     
deb https://nvidia.github.io/libnvidia-container/stable/debian10/$(ARCH) /
deb https://nvidia.github.io/libnvidia-container/experimental/debian10/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/stable/debian10/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/experimental/debian10/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/debian10/$(ARCH) /
$ cat /etc/nvidia-container-runtime/config.toml 
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
$ nvidia-smi 
Tue Feb 15 15:58:55 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GT 720      Off  | 00000000:01:00.0 N/A |                  N/A |
| 30%   30C    P0    N/A /  N/A |      0MiB /   980MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ docker run --rm --runtime nvidia nvidia/cuda:11.0-base bash -c "nvidia-smi; echo ""; ldconfig && nvidia-smi" 
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

Tue Feb 15 15:01:06 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GT 720      Off  | 00000000:01:00.0 N/A |                  N/A |
| 30%   30C    P0    N/A /  N/A |      0MiB /   980MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ nvidia-container-cli -k -d /dev/tty info 2>&1

-- WARNING, the following logs are for debugging purposes only --

I0215 15:09:07.702294 2695077 nvc.c:376] initializing library context (version=1.8.1, build=abd4e14d8cb923e2a70b7dcfee55fbc16bffa353)
I0215 15:09:07.702411 2695077 nvc.c:350] using root /
I0215 15:09:07.702426 2695077 nvc.c:351] using ldcache /etc/ld.so.cache
I0215 15:09:07.702443 2695077 nvc.c:352] using unprivileged user 1000:1000
I0215 15:09:07.702499 2695077 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0215 15:09:07.702808 2695077 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0215 15:09:07.704132 2695079 nvc.c:273] failed to set inheritable capabilities
W0215 15:09:07.704233 2695079 nvc.c:274] skipping kernel modules load due to failure
I0215 15:09:07.704873 2695080 rpc.c:71] starting driver rpc service
I0215 15:09:07.861413 2695086 rpc.c:71] starting nvcgo rpc service
I0215 15:09:07.861949 2695077 nvc_info.c:759] requesting driver information with ''
I0215 15:09:07.862871 2695077 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.460.91.03
I0215 15:09:07.862923 2695077 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.460.91.03
I0215 15:09:07.862964 2695077 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.460.91.03
I0215 15:09:07.863012 2695077 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.91.03
I0215 15:09:07.863035 2695077 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.91.03
I0215 15:09:07.863055 2695077 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.460.91.03
I0215 15:09:07.863154 2695077 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.460.91.03
I0215 15:09:07.863234 2695077 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.460.91.03
I0215 15:09:07.863273 2695077 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.460.91.03
W0215 15:09:07.863284 2695077 nvc_info.c:398] missing library libnvidia-cfg.so
W0215 15:09:07.863287 2695077 nvc_info.c:398] missing library libnvidia-nscq.so
W0215 15:09:07.863290 2695077 nvc_info.c:398] missing library libnvidia-opencl.so
W0215 15:09:07.863292 2695077 nvc_info.c:398] missing library libnvidia-fatbinaryloader.so
W0215 15:09:07.863294 2695077 nvc_info.c:398] missing library libnvidia-allocator.so
W0215 15:09:07.863296 2695077 nvc_info.c:398] missing library libnvidia-compiler.so
W0215 15:09:07.863298 2695077 nvc_info.c:398] missing library libnvidia-pkcs11.so
W0215 15:09:07.863300 2695077 nvc_info.c:398] missing library libnvidia-ngx.so
W0215 15:09:07.863302 2695077 nvc_info.c:398] missing library libvdpau_nvidia.so
W0215 15:09:07.863304 2695077 nvc_info.c:398] missing library libnvidia-encode.so
W0215 15:09:07.863306 2695077 nvc_info.c:398] missing library libnvidia-opticalflow.so
W0215 15:09:07.863308 2695077 nvc_info.c:398] missing library libnvcuvid.so
W0215 15:09:07.863311 2695077 nvc_info.c:398] missing library libnvidia-fbc.so
W0215 15:09:07.863313 2695077 nvc_info.c:398] missing library libnvidia-ifr.so
W0215 15:09:07.863315 2695077 nvc_info.c:398] missing library libnvidia-rtcore.so
W0215 15:09:07.863317 2695077 nvc_info.c:398] missing library libnvoptix.so
W0215 15:09:07.863319 2695077 nvc_info.c:398] missing library libGLESv2_nvidia.so
W0215 15:09:07.863321 2695077 nvc_info.c:398] missing library libGLESv1_CM_nvidia.so
W0215 15:09:07.863323 2695077 nvc_info.c:398] missing library libnvidia-glvkspirv.so
W0215 15:09:07.863327 2695077 nvc_info.c:398] missing library libnvidia-cbl.so
W0215 15:09:07.863329 2695077 nvc_info.c:402] missing compat32 library libnvidia-ml.so
W0215 15:09:07.863333 2695077 nvc_info.c:402] missing compat32 library libnvidia-cfg.so
W0215 15:09:07.863336 2695077 nvc_info.c:402] missing compat32 library libnvidia-nscq.so
W0215 15:09:07.863340 2695077 nvc_info.c:402] missing compat32 library libcuda.so
W0215 15:09:07.863343 2695077 nvc_info.c:402] missing compat32 library libnvidia-opencl.so
W0215 15:09:07.863347 2695077 nvc_info.c:402] missing compat32 library libnvidia-ptxjitcompiler.so
W0215 15:09:07.863350 2695077 nvc_info.c:402] missing compat32 library libnvidia-fatbinaryloader.so
W0215 15:09:07.863354 2695077 nvc_info.c:402] missing compat32 library libnvidia-allocator.so
W0215 15:09:07.863357 2695077 nvc_info.c:402] missing compat32 library libnvidia-compiler.so
W0215 15:09:07.863360 2695077 nvc_info.c:402] missing compat32 library libnvidia-pkcs11.so
W0215 15:09:07.863363 2695077 nvc_info.c:402] missing compat32 library libnvidia-ngx.so
W0215 15:09:07.863367 2695077 nvc_info.c:402] missing compat32 library libvdpau_nvidia.so
W0215 15:09:07.863370 2695077 nvc_info.c:402] missing compat32 library libnvidia-encode.so
W0215 15:09:07.863374 2695077 nvc_info.c:402] missing compat32 library libnvidia-opticalflow.so
W0215 15:09:07.863378 2695077 nvc_info.c:402] missing compat32 library libnvcuvid.so
W0215 15:09:07.863381 2695077 nvc_info.c:402] missing compat32 library libnvidia-eglcore.so
W0215 15:09:07.863384 2695077 nvc_info.c:402] missing compat32 library libnvidia-glcore.so
W0215 15:09:07.863388 2695077 nvc_info.c:402] missing compat32 library libnvidia-tls.so
W0215 15:09:07.863392 2695077 nvc_info.c:402] missing compat32 library libnvidia-glsi.so
W0215 15:09:07.863396 2695077 nvc_info.c:402] missing compat32 library libnvidia-fbc.so
W0215 15:09:07.863399 2695077 nvc_info.c:402] missing compat32 library libnvidia-ifr.so
W0215 15:09:07.863402 2695077 nvc_info.c:402] missing compat32 library libnvidia-rtcore.so
W0215 15:09:07.863404 2695077 nvc_info.c:402] missing compat32 library libnvoptix.so
W0215 15:09:07.863408 2695077 nvc_info.c:402] missing compat32 library libGLX_nvidia.so
W0215 15:09:07.863412 2695077 nvc_info.c:402] missing compat32 library libEGL_nvidia.so
W0215 15:09:07.863415 2695077 nvc_info.c:402] missing compat32 library libGLESv2_nvidia.so
W0215 15:09:07.863418 2695077 nvc_info.c:402] missing compat32 library libGLESv1_CM_nvidia.so
W0215 15:09:07.863421 2695077 nvc_info.c:402] missing compat32 library libnvidia-glvkspirv.so
W0215 15:09:07.863425 2695077 nvc_info.c:402] missing compat32 library libnvidia-cbl.so
I0215 15:09:07.863506 2695077 nvc_info.c:298] selecting /usr/lib/nvidia/current/nvidia-smi
I0215 15:09:07.863532 2695077 nvc_info.c:298] selecting /usr/lib/nvidia/current/nvidia-debugdump
W0215 15:09:07.863622 2695077 nvc_info.c:424] missing binary nvidia-persistenced
W0215 15:09:07.863626 2695077 nvc_info.c:424] missing binary nv-fabricmanager
W0215 15:09:07.863629 2695077 nvc_info.c:424] missing binary nvidia-cuda-mps-control
W0215 15:09:07.863632 2695077 nvc_info.c:424] missing binary nvidia-cuda-mps-server
W0215 15:09:07.863647 2695077 nvc_info.c:348] missing firmware path /lib/firmware/nvidia/460.91.03/gsp.bin
I0215 15:09:07.863665 2695077 nvc_info.c:522] listing device /dev/nvidiactl
I0215 15:09:07.863668 2695077 nvc_info.c:522] listing device /dev/nvidia-uvm
I0215 15:09:07.863670 2695077 nvc_info.c:522] listing device /dev/nvidia-uvm-tools
I0215 15:09:07.863672 2695077 nvc_info.c:522] listing device /dev/nvidia-modeset
W0215 15:09:07.863687 2695077 nvc_info.c:348] missing ipc path /var/run/nvidia-persistenced/socket
W0215 15:09:07.863701 2695077 nvc_info.c:348] missing ipc path /var/run/nvidia-fabricmanager/socket
W0215 15:09:07.863712 2695077 nvc_info.c:348] missing ipc path /tmp/nvidia-mps
I0215 15:09:07.863716 2695077 nvc_info.c:815] requesting device information with ''
I0215 15:09:07.869784 2695077 nvc_info.c:706] listing device /dev/nvidia0 (GPU-80fc26fb-9db1-5b79-2372-23dfaf7cc99c at 00000000:01:00.0)
I0215 15:09:07.869800 2695077 nvc.c:430] shutting down library context
I0215 15:09:07.869859 2695086 rpc.c:95] terminating nvcgo rpc service
I0215 15:09:07.870174 2695077 rpc.c:135] nvcgo rpc service terminated successfully
I0215 15:09:07.891932 2695080 rpc.c:95] terminating driver rpc service
I0215 15:09:07.892048 2695077 rpc.c:135] driver rpc service terminated successfully
NVRM version:   460.91.03
CUDA version:   11.2

Device Index:   0
Device Minor:   0
Model:          GeForce GT 720
Brand:          GeForce
GPU UUID:       GPU-80fc26fb-9db1-5b79-2372-23dfaf7cc99c
Bus Location:   00000000:01:00.0
Architecture:   3.5
$ sudo which ldconfig
/usr/sbin/ldconfig
$ l /sbin 
lrwxrwxrwx 1 root root 8 Sep 28 16:08 /sbin -> usr/sbin
$ sudo apt list --installed | grep libc-bin
libc-bin/stable,now 2.31-13+deb11u2 amd64 [installed]
$ dpkg -S /sbin/ldconfig                   
libc-bin: /sbin/ldconfig

The workaround mentioned everywhere of editing /etc/nvidia-container-runtime/config.toml and removing the @ of ldconfig path works to some extends. But I don't want to add ldconfig to all images using nvidia runtime.

I don't have a ldconfig.real on my system, and I'm not sure where to find it.

Is there a clear status on this issue ? Is a fix already exist ?
(I can provide additionnal debug data or reproduction case if needed)

Tanks for taking the time, 👌
Mika

Edit, add ref: NVIDIA/nvidia-container-toolkit#299 , #1163 , #1537

@klueska
Copy link
Contributor

klueska commented Feb 15, 2022

This issue is reported often, but as i mention here #1399 (comment), I have never been able to reproduce this bug myself. Until I am able to reproduce it, I will not be able to provide a fix.

Even just now I spun up two fresh VMs on AWS -- one with Debian 10 and one with Debian 11 -- and was not able to reproduce the error. I followed the same procedure outlined in #1399 (comment) (with the only difference being that I pulled down a newer driver).

Is there some hint you can give me on how to get a Debian 10 or Debian 11 system up and running (in a supported state) that exhibits this bug?

@mikafouenski
Copy link

mikafouenski commented Feb 16, 2022

Hello,

I don't have the possibility to spawn gpu vm, unfortunatly I only have the bare metal debian 11 on which I'm testing...
I've tried to update systemd and linux kernel to latest. Even tried to rollback to cgroupv1 but the issue persist in all these cases...

Here are more info from my debug:
$ sudo execsnoop-bpfcc
docker           34242  21523    0 /usr/bin/docker run -it --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
systemd-sysctl   34260  34252    0 /lib/systemd/systemd-sysctl --prefix=/net/ipv4/conf/vethea26104 --prefix=/net/ipv4/neigh/vethea26104 --prefix=/net/ipv6/conf/vethea26104 --prefix=/net/ipv6/neigh/vethea26104
systemd-sysctl   34261  34253    0 /lib/systemd/systemd-sysctl --prefix=/net/ipv4/conf/vethaf1566b --prefix=/net/ipv4/neigh/vethaf1566b --prefix=/net/ipv6/conf/vethaf1566b --prefix=/net/ipv6/neigh/vethaf1566b
containerd-shim  34264  1102     0 /usr/bin/containerd-shim-runc-v2 -namespace moby -address /run/containerd/containerd.sock -publish-binary /usr/bin/containerd -id ac03866ef7f4d6e96a0a9d29ed400eaf92558c3fe72af82c44aad7fbfe331f98 start
containerd-shim  34272  34264    0 /usr/bin/containerd-shim-runc-v2 -namespace moby -id ac03866ef7f4d6e96a0a9d29ed400eaf92558c3fe72af82c44aad7fbfe331f98 -address /run/containerd/containerd.sock
runc             34281  34272    0 /usr/bin/runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/ac03866ef7f4d6e96a0a9d29ed400eaf92558c3fe72af82c44aad7fbfe331f98/log.json --log-format json --systemd-cgroup create --bundle /run/containerd/io.containerd.runtime.v2.task/moby/ac03866ef7f4d6e96a0a9d29ed400eaf92558c3fe72af82c44aad7fbfe331f98 --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/ac03866ef7f4d6e96a0a9d29ed400eaf92558c3fe72af82c44aad7fbfe331f98/init.pid --console-socket /tmp/pty885547821/pty.sock ac03866ef7f4d6e96a0a9d29ed400eaf92558c3fe72af82c44aad7fbfe331f98
exe              34289  34281    0 /proc/self/exe init
nvidia-containe  34298  34281    0 /usr/bin/nvidia-container-runtime-hook prestart
nvidia-containe  34298  34281    0 /usr/bin/nvidia-container-cli --load-kmods --debug=/var/log/nvidia-container-toolkit.log configure --ldconfig=@/usr/sbin/ldconfig --device=all --compute --utility --require=cuda>=11.0 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 --pid=34291 /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged
exe              34317  34281    0 /proc/29870/exe -exec-root=/var/run/docker ac03866ef7f4d6e96a0a9d29ed400eaf92558c3fe72af82c44aad7fbfe331f98 15b15f95d697
exe              34324  29870    0 /proc/self/exe /var/run/docker/netns/b7dfbae12bcc all false
runc             34335  34272    0 /usr/bin/runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/ac03866ef7f4d6e96a0a9d29ed400eaf92558c3fe72af82c44aad7fbfe331f98/log.json --log-format json --systemd-cgroup start ac03866ef7f4d6e96a0a9d29ed400eaf92558c3fe72af82c44aad7fbfe331f98
nvidia-smi       34291  34272    0 /usr/bin/nvidia-smi
runc             34341  34272    0 /usr/bin/runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/ac03866ef7f4d6e96a0a9d29ed400eaf92558c3fe72af82c44aad7fbfe331f98/log.json --log-format json --systemd-cgroup delete ac03866ef7f4d6e96a0a9d29ed400eaf92558c3fe72af82c44aad7fbfe331f98
ifupdown-hotplu  34349  34253    0 /lib/udev/ifupdown-hotplug
ifquery          34351  34350    0 /sbin/ifquery --allow hotplug -l vethea26104
systemd-sysctl   34352  34253    0 /lib/systemd/systemd-sysctl --prefix=/net/ipv4/conf/vethea26104 --prefix=/net/ipv4/neigh/vethea26104 --prefix=/net/ipv6/conf/vethea26104 --prefix=/net/ipv6/neigh/vethea26104
$ sudo opensnoop-bpfcc | grep -i ldconfig
34082  nvidia-containe     8   0 /usr/sbin/ldconfig                                                                                             
34100  nvc:[ldconfig]      9   0 /proc/34075/ns/mnt                                                                                             
34100  nvc:[ldconfig]      9   0 /proc/sys/kernel/cap_last_cap                                                                         
34100  nvc:[ldconfig]      9   0 /                                                                                                     
34100  nvc:[ldconfig]     10   0 /mnt/data/docker/overlay2/808b3ab7d292be666290befdb0622aecd54b74e25f5350ba83c394107e1ef822/merged                                                                                                                                             
34100  nvc:[ldconfig]     11   0 /proc/self/setgroups                                                                                  
$ less /var/log/nvidia-container-toolkit.log
-- WARNING, the following logs are for debugging purposes only --

I0216 16:10:49.804494 34298 nvc.c:376] initializing library context (version=1.8.1, build=abd4e14d8cb923e2a70b7dcfee55fbc16bffa353)
I0216 16:10:49.804640 34298 nvc.c:350] using root /
I0216 16:10:49.804663 34298 nvc.c:351] using ldcache /etc/ld.so.cache
I0216 16:10:49.804681 34298 nvc.c:352] using unprivileged user 65534:65534
I0216 16:10:49.804730 34298 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0216 16:10:49.805036 34298 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I0216 16:10:49.806385 34304 nvc.c:278] loading kernel module nvidia
I0216 16:10:49.806920 34304 nvc.c:282] running mknod for /dev/nvidiactl
I0216 16:10:49.807019 34304 nvc.c:286] running mknod for /dev/nvidia0
I0216 16:10:49.807091 34304 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I0216 16:10:49.813686 34304 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0216 16:10:49.813739 34304 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0216 16:10:49.814850 34304 nvc.c:296] loading kernel module nvidia_uvm
I0216 16:10:49.814877 34304 nvc.c:300] running mknod for /dev/nvidia-uvm
I0216 16:10:49.814908 34304 nvc.c:305] loading kernel module nvidia_modeset
I0216 16:10:49.814976 34304 nvc.c:309] running mknod for /dev/nvidia-modeset
I0216 16:10:49.815155 34305 rpc.c:71] starting driver rpc service
I0216 16:10:49.967757 34309 rpc.c:71] starting nvcgo rpc service
I0216 16:10:49.968383 34298 nvc_container.c:240] configuring container with 'compute utility supervised'
I0216 16:10:49.968567 34298 nvc_container.c:88] selecting /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/usr/local/cuda-11.0/compat/libcuda.so.450.51.06
I0216 16:10:49.968621 34298 nvc_container.c:88] selecting /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/usr/local/cuda-11.0/compat/libnvidia-ptxjitcompiler.so.450.51.06
I0216 16:10:49.969680 34298 nvc_container.c:262] setting pid to 34291
I0216 16:10:49.969694 34298 nvc_container.c:263] setting rootfs to /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged
I0216 16:10:49.969700 34298 nvc_container.c:264] setting owner to 0:0
I0216 16:10:49.969705 34298 nvc_container.c:265] setting bins directory to /usr/bin
I0216 16:10:49.969710 34298 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu
I0216 16:10:49.969715 34298 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu
I0216 16:10:49.969720 34298 nvc_container.c:268] setting cudart directory to /usr/local/cuda
I0216 16:10:49.969724 34298 nvc_container.c:269] setting ldconfig to @/usr/sbin/ldconfig (host relative)
I0216 16:10:49.969729 34298 nvc_container.c:270] setting mount namespace to /proc/34291/ns/mnt
I0216 16:10:49.969742 34298 nvc_container.c:272] detected cgroupv2
I0216 16:10:49.969747 34298 nvc_container.c:273] setting devices cgroup to /sys/fs/cgroup/system.slice/docker-ac03866ef7f4d6e96a0a9d29ed400eaf92558c3fe72af82c44aad7fbfe331f98.scope
I0216 16:10:49.969755 34298 nvc_info.c:759] requesting driver information with ''
I0216 16:10:49.970765 34298 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.460.91.03
I0216 16:10:49.970835 34298 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.460.91.03
I0216 16:10:49.970894 34298 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.460.91.03
I0216 16:10:49.970962 34298 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.91.03
I0216 16:10:49.970993 34298 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.91.03
I0216 16:10:49.971025 34298 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.460.91.03
I0216 16:10:49.971155 34298 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.460.91.03
I0216 16:10:49.971268 34298 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.460.91.03
I0216 16:10:49.971323 34298 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.460.91.03
W0216 16:10:49.971355 34298 nvc_info.c:398] missing library libnvidia-cfg.so
W0216 16:10:49.971361 34298 nvc_info.c:398] missing library libnvidia-nscq.so
W0216 16:10:49.971367 34298 nvc_info.c:398] missing library libnvidia-opencl.so
W0216 16:10:49.971372 34298 nvc_info.c:398] missing library libnvidia-fatbinaryloader.so
W0216 16:10:49.971377 34298 nvc_info.c:398] missing library libnvidia-allocator.so
W0216 16:10:49.971383 34298 nvc_info.c:398] missing library libnvidia-compiler.so
W0216 16:10:49.971388 34298 nvc_info.c:398] missing library libnvidia-pkcs11.so
W0216 16:10:49.971393 34298 nvc_info.c:398] missing library libnvidia-ngx.so
W0216 16:10:49.971398 34298 nvc_info.c:398] missing library libvdpau_nvidia.so
W0216 16:10:49.971404 34298 nvc_info.c:398] missing library libnvidia-encode.so
W0216 16:10:49.971409 34298 nvc_info.c:398] missing library libnvidia-opticalflow.so
W0216 16:10:49.971414 34298 nvc_info.c:398] missing library libnvcuvid.so
W0216 16:10:49.971420 34298 nvc_info.c:398] missing library libnvidia-fbc.so
W0216 16:10:49.971425 34298 nvc_info.c:398] missing library libnvidia-ifr.so
W0216 16:10:49.971430 34298 nvc_info.c:398] missing library libnvidia-rtcore.so
W0216 16:10:49.971436 34298 nvc_info.c:398] missing library libnvoptix.so
W0216 16:10:49.971441 34298 nvc_info.c:398] missing library libGLESv2_nvidia.so
W0216 16:10:49.971446 34298 nvc_info.c:398] missing library libGLESv1_CM_nvidia.so
W0216 16:10:49.971452 34298 nvc_info.c:398] missing library libnvidia-glvkspirv.so
W0216 16:10:49.971457 34298 nvc_info.c:398] missing library libnvidia-cbl.so
W0216 16:10:49.971462 34298 nvc_info.c:402] missing compat32 library libnvidia-ml.so
W0216 16:10:49.971468 34298 nvc_info.c:402] missing compat32 library libnvidia-cfg.so
W0216 16:10:49.971473 34298 nvc_info.c:402] missing compat32 library libnvidia-nscq.so
W0216 16:10:49.971478 34298 nvc_info.c:402] missing compat32 library libcuda.so
W0216 16:10:49.971483 34298 nvc_info.c:402] missing compat32 library libnvidia-opencl.so
W0216 16:10:49.971489 34298 nvc_info.c:402] missing compat32 library libnvidia-ptxjitcompiler.so
W0216 16:10:49.971494 34298 nvc_info.c:402] missing compat32 library libnvidia-fatbinaryloader.so
W0216 16:10:49.971499 34298 nvc_info.c:402] missing compat32 library libnvidia-allocator.so
W0216 16:10:49.971505 34298 nvc_info.c:402] missing compat32 library libnvidia-compiler.so
W0216 16:10:49.971510 34298 nvc_info.c:402] missing compat32 library libnvidia-pkcs11.so
W0216 16:10:49.971515 34298 nvc_info.c:402] missing compat32 library libnvidia-ngx.so
W0216 16:10:49.971521 34298 nvc_info.c:402] missing compat32 library libvdpau_nvidia.so
W0216 16:10:49.971526 34298 nvc_info.c:402] missing compat32 library libnvidia-encode.so
W0216 16:10:49.971536 34298 nvc_info.c:402] missing compat32 library libnvidia-opticalflow.so
W0216 16:10:49.971542 34298 nvc_info.c:402] missing compat32 library libnvcuvid.so
W0216 16:10:49.971547 34298 nvc_info.c:402] missing compat32 library libnvidia-eglcore.so
W0216 16:10:49.971553 34298 nvc_info.c:402] missing compat32 library libnvidia-glcore.so
W0216 16:10:49.971558 34298 nvc_info.c:402] missing compat32 library libnvidia-tls.so
W0216 16:10:49.971563 34298 nvc_info.c:402] missing compat32 library libnvidia-glsi.so
W0216 16:10:49.971569 34298 nvc_info.c:402] missing compat32 library libnvidia-fbc.so
W0216 16:10:49.971574 34298 nvc_info.c:402] missing compat32 library libnvidia-ifr.so
W0216 16:10:49.971579 34298 nvc_info.c:402] missing compat32 library libnvidia-rtcore.so
W0216 16:10:49.971585 34298 nvc_info.c:402] missing compat32 library libnvoptix.so
W0216 16:10:49.971590 34298 nvc_info.c:402] missing compat32 library libGLX_nvidia.so
W0216 16:10:49.971595 34298 nvc_info.c:402] missing compat32 library libEGL_nvidia.so
W0216 16:10:49.971600 34298 nvc_info.c:402] missing compat32 library libGLESv2_nvidia.so
W0216 16:10:49.971606 34298 nvc_info.c:402] missing compat32 library libGLESv1_CM_nvidia.so
W0216 16:10:49.971611 34298 nvc_info.c:402] missing compat32 library libnvidia-glvkspirv.so
W0216 16:10:49.971616 34298 nvc_info.c:402] missing compat32 library libnvidia-cbl.so
I0216 16:10:49.971895 34298 nvc_info.c:298] selecting /usr/lib/nvidia/current/nvidia-smi
I0216 16:10:49.971937 34298 nvc_info.c:298] selecting /usr/lib/nvidia/current/nvidia-debugdump
W0216 16:10:49.972378 34298 nvc_info.c:424] missing binary nvidia-persistenced
W0216 16:10:49.972384 34298 nvc_info.c:424] missing binary nv-fabricmanager
W0216 16:10:49.972390 34298 nvc_info.c:424] missing binary nvidia-cuda-mps-control
W0216 16:10:49.972395 34298 nvc_info.c:424] missing binary nvidia-cuda-mps-server
W0216 16:10:49.972418 34298 nvc_info.c:348] missing firmware path /lib/firmware/nvidia/460.91.03/gsp.bin
I0216 16:10:49.972445 34298 nvc_info.c:522] listing device /dev/nvidiactl
I0216 16:10:49.972452 34298 nvc_info.c:522] listing device /dev/nvidia-uvm
I0216 16:10:49.972457 34298 nvc_info.c:522] listing device /dev/nvidia-uvm-tools
I0216 16:10:49.972462 34298 nvc_info.c:522] listing device /dev/nvidia-modeset
W0216 16:10:49.972486 34298 nvc_info.c:348] missing ipc path /var/run/nvidia-persistenced/socket
W0216 16:10:49.972509 34298 nvc_info.c:348] missing ipc path /var/run/nvidia-fabricmanager/socket
W0216 16:10:49.972525 34298 nvc_info.c:348] missing ipc path /tmp/nvidia-mps
I0216 16:10:49.972531 34298 nvc_info.c:815] requesting device information with ''
I0216 16:10:49.978491 34298 nvc_info.c:706] listing device /dev/nvidia0 (GPU-80fc26fb-9db1-5b79-2372-23dfaf7cc99c at 00000000:01:00.0)
I0216 16:10:49.978566 34298 nvc_mount.c:359] mounting tmpfs at /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/proc/driver/nvidia
I0216 16:10:49.978918 34298 nvc_mount.c:127] mounting /usr/lib/nvidia/current/nvidia-smi at /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/usr/bin/nvidia-smi
I0216 16:10:49.978980 34298 nvc_mount.c:127] mounting /usr/lib/nvidia/current/nvidia-debugdump at /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/usr/bin/nvidia-debugdump
I0216 16:10:49.979148 34298 nvc_mount.c:127] mounting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.460.91.03 at /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.460.91.03
I0216 16:10:49.979208 34298 nvc_mount.c:127] mounting /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.460.91.03 at /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/usr/lib/x86_64-linux-gnu/libcuda.so.460.91.03
I0216 16:10:49.979265 34298 nvc_mount.c:127] mounting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.460.91.03 at /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.460.91.03
I0216 16:10:49.979295 34298 nvc_mount.c:520] creating symlink /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
I0216 16:10:49.979412 34298 nvc_mount.c:127] mounting /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/usr/local/cuda-11.0/compat/libcuda.so.450.51.06 at /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/usr/lib/x86_64-linux-gnu/libcuda.so.450.51.06
I0216 16:10:49.979472 34298 nvc_mount.c:127] mounting /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/usr/local/cuda-11.0/compat/libnvidia-ptxjitcompiler.so.450.51.06 at /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.51.06
I0216 16:10:49.979524 34298 nvc_mount.c:223] mounting /dev/nvidiactl at /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/dev/nvidiactl
I0216 16:10:49.980163 34298 nvc_mount.c:223] mounting /dev/nvidia-uvm at /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/dev/nvidia-uvm
I0216 16:10:49.980642 34298 nvc_mount.c:223] mounting /dev/nvidia-uvm-tools at /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/dev/nvidia-uvm-tools
I0216 16:10:49.981110 34298 nvc_mount.c:223] mounting /dev/nvidia0 at /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/dev/nvidia0
I0216 16:10:49.981205 34298 nvc_mount.c:433] mounting /proc/driver/nvidia/gpus/0000:01:00.0 at /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged/proc/driver/nvidia/gpus/0000:01:00.0
I0216 16:10:49.981773 34298 nvc_ldcache.c:362] executing /usr/sbin/ldconfig from host at /mnt/data/docker/overlay2/4f6f7535fff0b84c73cff3f7ee2cb8f581dcb69c871b23266e78559ccf16a8ae/merged
E0216 16:10:49.982844 1 nvc_ldcache.c:393] could not start /usr/sbin/ldconfig: process execution failed: no such file or directory
I0216 16:10:49.983136 34298 nvc.c:430] shutting down library context
I0216 16:10:49.983235 34309 rpc.c:95] terminating nvcgo rpc service
I0216 16:10:49.983646 34298 rpc.c:135] nvcgo rpc service terminated successfully
I0216 16:10:50.005337 34305 rpc.c:95] terminating driver rpc service
I0216 16:10:50.005550 34298 rpc.c:135] driver rpc service terminated successfully

E0216 16:10:49.982844 1 nvc_ldcache.c:393] could not start /usr/sbin/ldconfig: process execution failed: no such file or directory
https://github.com/NVIDIA/libnvidia-container/blob/master/src/nvc_ldcache.c#L393
Look like I'm failling at one of the tests from this file.

Probably one off these given my opensnoop output

                if (adjust_privileges(&ctx->err, cnt->uid, cnt->gid, drop_groups) < 0)
                        goto fail;
                if (limit_syscalls(&ctx->err) < 0)
                        goto fail;

Maybe you have some more insight ?

Mika

@klueska
Copy link
Contributor

klueska commented Feb 17, 2022

Can you try running the following script as root (or sudo) on your machine and post the output of the TRACE_FILE here?

#!/usr/bin/env bash

TRACE_FILE="/tmp/nvidia-container-cli.strace"

mv /usr/bin/nvidia-container-cli /usr/bin/nvidia-container-cli.real
cat << EOF > /usr/bin/nvidia-container-cli
#!/usr/bin/env bash
strace -f nvidia-container-cli.real "\${@}" > ${TRACE_FILE} 2>&1
EOF
chmod a+x /usr/bin/nvidia-container-cli
docker run --rm --runtime nvidia nvidia/cuda:11.0-base nvidia-smi
mv /usr/bin/nvidia-container-cli.real /usr/bin/nvidia-container-cli

@mikafouenski
Copy link

@klueska I'm not sure this was the expected result but I'm getting:

$ docker run -it --rm --gpus all nvidia/cuda:11.0-base nvidia-smi       
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr:: unknown.

nvidia-container-cli.strace.txt

Does this give us more info ?

@klueska
Copy link
Contributor

klueska commented Feb 17, 2022

I would have expected the error to be the same, not something different.
Did you drop those contents in a script and execute it with sudo, or do something different?

@mikafouenski
Copy link

mikafouenski commented Feb 17, 2022

My bad, I had my setup in a weird state after changing to other kernels...
Here is the real output:
nvidia-container-cli.strace.txt

line 4913:

[pid 10012] execveat(8, "", ["/sbin/ldconfig", "/usr/lib/x86_64-linux-gnu", "/usr/lib/i386-linux-gnu"], 0x7ffd3423e7f0 /* 0 vars */, AT_EMPTY_PATH) = -1 ENOENT (No such file or directory)

@klueska
Copy link
Contributor

klueska commented Feb 17, 2022

Just to be sure -- you have an /sbin/ldconfig file on your system, correct?
I saw some reference to /usr/sbin/ldconfig in your previous comments. If you change the config to point to that, does that change anything?

@klueska
Copy link
Contributor

klueska commented Feb 17, 2022

Likewise, on my debian11 system I have an /sbin/ldconfig.real file, which is what the default nvidia-container-runtime/config.yaml should be pointing to (and it work as expected on my system).

@mikafouenski
Copy link

mikafouenski commented Feb 17, 2022

Yes, I do have ldconfig. On debian sbin is a symbolic link to /usr/sbin.

$ l /sbin 
lrwxrwxrwx 1 root root 8 Sep 28 16:08 /sbin -> usr/sbin
$ l /usr/sbin/ldconfig
-rwxr-xr-x 1 root root 928K Oct  2 14:47 /usr/sbin/ldconfig
$ md5sum /usr/sbin/ldconfig /sbin/ldconfig 
634a4cf316a25d01a21fba9baadcbb8c  /usr/sbin/ldconfig
634a4cf316a25d01a21fba9baadcbb8c  /sbin/ldconfig
$ dpkg -S /sbin/ldconfig                  
libc-bin: /sbin/ldconfig
$ sudo apt list --installed | grep libc-bin 
libc-bin/stable,now 2.31-13+deb11u2 amd64 [installed]

Can you know from which package your ldconfig.real come from ? I do not have it...

Edit: Debian dont know where this file come from: https://packages.debian.org/search?suite=bullseye&arch=any&searchon=contents&keywords=%2Fsbin%2Fldconfig.real

@klueska
Copy link
Contributor

klueska commented Feb 17, 2022

Oops! I've been running my latest set of commands on an Ubuntu 21.10 system (not Debian 11).
Let me start over and get back to you. I'd really like to get to the bottom of this finally after all this time.

@mikafouenski
Copy link

For reproductibily:

I've installed debian from the official iso.

Installed the nvidia-driver and cuda from the debian repo:

apt install --no-install-recommends nvidia-driver nvidia-cuda-toolkit

Installed docker with their repo:
https://docs.docker.com/engine/install/debian/#install-using-the-repository

Installed nvidia-docker from your repo:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit

@klueska
Copy link
Contributor

klueska commented Feb 18, 2022

I was finally able to reproduce the issue myself and track down the cause to produce a proper fix:
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/141

I will report back here (and to the multitude of other bugs that exist for this issue) once this is reviewed / merged / released.

@kenibrewer
Copy link

@klueska Is there an estimated timeline for when the changes for this bug will be released?

@klueska
Copy link
Contributor

klueska commented Mar 8, 2022

We had planned to release it 2 weeks ago, but were (and still are) blocked by some internal processes preventing us from pushing out the release. I will update here once we are unblocked.

@klueska
Copy link
Contributor

klueska commented Mar 22, 2022

The newest version of nvidia-docker should resolve these issues with ldconfig not properly setting up the library search path on debian systems before a container gets launched.

Specifically this change in libnvidia-container fixes the issue and is included as part of the latest release:
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/141

The latest release packages for the full nvidia-docker stack:

libnvidia-container1-1.9.0
libnvidia-container-tools-1.9.0
nvidia-container-toolkit-1.9.0
nvidia-container-runtime-3.9.0
nvidia-docker-2.10.0

@mikafouenski
Copy link

Hello @klueska,

I confirm this is working with the versions you quoted !

Nice work, thanks. 🎉

@elezar
Copy link
Member

elezar commented May 4, 2022

@brycelelbach would you be able to confirm that the new version of the NVIDIA Container Toolkit (v1.9.0) addresses this issue for you?

@elezar
Copy link
Member

elezar commented May 10, 2022

@brycelelbach I'm closing this issue as it should be resolved by 1.9.0. Please reopen if this does not address the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests