Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Couldn't find libnvidia-ml.so library in your system #854

Closed
8 tasks
alexanderfrey opened this issue Nov 2, 2018 · 42 comments
Closed
8 tasks

Couldn't find libnvidia-ml.so library in your system #854

alexanderfrey opened this issue Nov 2, 2018 · 42 comments

Comments

@alexanderfrey
Copy link

alexanderfrey commented Nov 2, 2018


1. Issue or feature description

Missing libnvidia-ml.so and libcublas.9.so library in docker container.

My system is Ubuntu 18.10 and I tried with nvidia drivers 390, 396 and 410.

2. Steps to reproduce the issue

docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

This also holds for the tensorflow docker images. When I run the cuda image in interactive mode and try to import tensorflow via python it says that libcublas.9.so is not found although I can see it in the /usr/local/cuda/lib64 directory.

Everything works fine on host machine though.

3. Information to attach (optional if deemed irrelevant)

  • Kernel version from uname -a
Linux box 4.18.0-10-generic #11-Ubuntu SMP Thu Oct 11 15:13:55 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Any relevant kernel output lines from dmesg
  • Driver information from nvidia-smi -a
==============NVSMI LOG==============

Timestamp                           : Fri Nov  2 11:09:45 2018
Driver Version                      : 410.73
CUDA Version                        : 10.0

Attached GPUs                       : 1
GPU 00000000:65:00.0
    Product Name                    : GeForce GTX 1080 Ti
    Product Brand                   : GeForce
    Display Mode                    : Enabled
    Display Active                  : Enabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-14bfddbd-9230-c05e-fa52-d468af601fc4
    Minor Number                    : 0
    VBIOS Version                   : 86.02.39.00.2E
    MultiGPU Board                  : No
    Board ID                        : 0x6500
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.01.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization mode         : None
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x65
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x1B0610DE
        Bus Id                      : 00000000:65:00.0
        Sub System Id               : 0x147019DA
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 3000 KB/s
        Rx Throughput               : 2000 KB/s
    Fan Speed                       : 0 %
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 11177 MiB
        Used                        : 751 MiB
        Free                        : 10426 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 6 MiB
        Free                        : 250 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 3 %
        Memory                      : 1 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
        Aggregate
            Single Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
            Double Bit            
                Device Memory       : N/A
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Texture Shared      : N/A
                CBU                 : N/A
                Total               : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending                     : N/A
    Temperature
        GPU Current Temp            : 35 C
        GPU Shutdown Temp           : 96 C
        GPU Slowdown Temp           : 93 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 60.67 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 125.00 W
        Max Power Limit             : 300.00 W
    Clocks
        Graphics                    : 1480 MHz
        SM                          : 1480 MHz
        Memory                      : 5508 MHz
        Video                       : 1265 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 1911 MHz
        SM                          : 1911 MHz
        Memory                      : 5505 MHz
        Video                       : 1620 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes
        Process ID                  : 1454
            Type                    : G
            Name                    : /usr/lib/xorg/Xorg
            Used GPU Memory         : 40 MiB
        Process ID                  : 1533
            Type                    : G
            Name                    : /usr/bin/gnome-shell
            Used GPU Memory         : 80 MiB
        Process ID                  : 2450
            Type                    : G
            Name                    : /usr/lib/xorg/Xorg
            Used GPU Memory         : 363 MiB
        Process ID                  : 2631
            Type                    : G
            Name                    : /usr/bin/gnome-shell
            Used GPU Memory         : 142 MiB
        Process ID                  : 3068
            Type                    : G
            Name                    : /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=15466691898050642703,2714747135580672923,131072 --enable-crash-reporter=b6227030-26a9-487c-b99f-efddda704fbf, --gpu-preferences=KAAAAAAAAACAAABAAQAAAAAAAAAAAGAAAAAAAAAAAAAIAAAAAAAAAAgAAAAAAAAA --enable-crash-reporter=b6227030-26a9-487c-b99f-efddda704fbf, --service-request-channel-token=405587616121577545
            Used GPU Memory         : 121 MiB
  • Docker version from docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:24:51 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:23:15 2018
  OS/Arch:          linux/amd64
  Experimental:     false
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
un  libgldispatch0-nvidia      <none>             <none>             (no description available)
ii  libnvidia-cfg1-410:amd64   410.73-0ubuntu0~gp amd64              NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any         <none>             <none>             (no description available)
un  libnvidia-common           <none>             <none>             (no description available)
ii  libnvidia-common-410       410.73-0ubuntu0~gp all                Shared files used by the NVIDIA libraries
rc  libnvidia-compute-390:amd6 390.87-0ubuntu1    amd64              NVIDIA libcompute package
rc  libnvidia-compute-390:i386 390.87-0ubuntu1    i386               NVIDIA libcompute package
rc  libnvidia-compute-396:amd6 396.54-0ubuntu0~gp amd64              NVIDIA libcompute package
rc  libnvidia-compute-396:i386 396.54-0ubuntu0~gp i386               NVIDIA libcompute package
ii  libnvidia-compute-410:amd6 410.73-0ubuntu0~gp amd64              NVIDIA libcompute package
ii  libnvidia-compute-410:i386 410.73-0ubuntu0~gp i386               NVIDIA libcompute package
ii  libnvidia-container-tools  1.0.0-1            amd64              NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64 1.0.0-1            amd64              NVIDIA container runtime library
un  libnvidia-decode           <none>             <none>             (no description available)
ii  libnvidia-decode-410:amd64 410.73-0ubuntu0~gp amd64              NVIDIA Video Decoding runtime libraries
ii  libnvidia-decode-410:i386  410.73-0ubuntu0~gp i386               NVIDIA Video Decoding runtime libraries
un  libnvidia-encode           <none>             <none>             (no description available)
ii  libnvidia-encode-410:amd64 410.73-0ubuntu0~gp amd64              NVENC Video Encoding runtime library
ii  libnvidia-encode-410:i386  410.73-0ubuntu0~gp i386               NVENC Video Encoding runtime library
un  libnvidia-fbc1             <none>             <none>             (no description available)
ii  libnvidia-fbc1-410:amd64   410.73-0ubuntu0~gp amd64              NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-fbc1-410:i386    410.73-0ubuntu0~gp i386               NVIDIA OpenGL-based Framebuffer Capture runtime library
un  libnvidia-gl               <none>             <none>             (no description available)
ii  libnvidia-gl-410:amd64     410.73-0ubuntu0~gp amd64              NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii  libnvidia-gl-410:i386      410.73-0ubuntu0~gp i386               NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un  libnvidia-ifr1             <none>             <none>             (no description available)
ii  libnvidia-ifr1-410:amd64   410.73-0ubuntu0~gp amd64              NVIDIA OpenGL-based Inband Frame Readback runtime library
ii  libnvidia-ifr1-410:i386    410.73-0ubuntu0~gp i386               NVIDIA OpenGL-based Inband Frame Readback runtime library
un  nvidia-304                 <none>             <none>             (no description available)
un  nvidia-340                 <none>             <none>             (no description available)
un  nvidia-384                 <none>             <none>             (no description available)
un  nvidia-390                 <none>             <none>             (no description available)
un  nvidia-common              <none>             <none>             (no description available)
rc  nvidia-compute-utils-390   390.87-0ubuntu1    amd64              NVIDIA compute utilities
rc  nvidia-compute-utils-396   396.54-0ubuntu0~gp amd64              NVIDIA compute utilities
ii  nvidia-compute-utils-410   410.73-0ubuntu0~gp amd64              NVIDIA compute utilities
ii  nvidia-container-runtime   2.0.0+docker18.06. amd64              NVIDIA container runtime
ii  nvidia-container-runtime-h 1.4.0-1            amd64              NVIDIA container runtime hook
ii  nvidia-cuda-dev            9.1.85-4ubuntu1    amd64              NVIDIA CUDA development files
ii  nvidia-cuda-doc            9.1.85-4ubuntu1    all                NVIDIA CUDA and OpenCL documentation
ii  nvidia-cuda-gdb            9.1.85-4ubuntu1    amd64              NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit        9.1.85-4ubuntu1    amd64              NVIDIA CUDA development toolkit
rc  nvidia-dkms-390            390.87-0ubuntu1    amd64              NVIDIA DKMS package
rc  nvidia-dkms-396            396.54-0ubuntu0~gp amd64              NVIDIA DKMS package
ii  nvidia-dkms-410            410.73-0ubuntu0~gp amd64              NVIDIA DKMS package
un  nvidia-dkms-kernel         <none>             <none>             (no description available)
un  nvidia-docker              <none>             <none>             (no description available)
ii  nvidia-docker2             2.0.3+docker18.06. all                nvidia-docker CLI wrapper
un  nvidia-driver              <none>             <none>             (no description available)
ii  nvidia-driver-410          410.73-0ubuntu0~gp amd64              NVIDIA driver metapackage
un  nvidia-driver-binary       <none>             <none>             (no description available)
un  nvidia-kernel-common       <none>             <none>             (no description available)
rc  nvidia-kernel-common-390   390.87-0ubuntu1    amd64              Shared files used with the kernel module
rc  nvidia-kernel-common-396   396.54-0ubuntu0~gp amd64              Shared files used with the kernel module
ii  nvidia-kernel-common-410   410.73-0ubuntu0~gp amd64              Shared files used with the kernel module
un  nvidia-kernel-source       <none>             <none>             (no description available)
un  nvidia-kernel-source-390   <none>             <none>             (no description available)
un  nvidia-kernel-source-396   <none>             <none>             (no description available)
ii  nvidia-kernel-source-410   410.73-0ubuntu0~gp amd64              NVIDIA kernel source package
un  nvidia-legacy-304xx-vdpau- <none>             <none>             (no description available)
un  nvidia-legacy-340xx-vdpau- <none>             <none>             (no description available)
un  nvidia-libopencl1          <none>             <none>             (no description available)
un  nvidia-libopencl1-dev      <none>             <none>             (no description available)
ii  nvidia-opencl-dev:amd64    9.1.85-4ubuntu1    amd64              NVIDIA OpenCL development files
un  nvidia-opencl-icd          <none>             <none>             (no description available)
ii  nvidia-openjdk-8-jre       9.1.85-4ubuntu1    amd64              NVIDIA provided OpenJDK Java runtime, using Hotspot JIT
un  nvidia-persistenced        <none>             <none>             (no description available)
ii  nvidia-prime               0.8.10             all                Tools to enable NVIDIA's Prime
ii  nvidia-profiler            9.1.85-4ubuntu1    amd64              NVIDIA Profiler for CUDA and OpenCL
ii  nvidia-settings            410.73-0ubuntu0~gp amd64              Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary     <none>             <none>             (no description available)
un  nvidia-smi                 <none>             <none>             (no description available)
un  nvidia-utils               <none>             <none>             (no description available)
ii  nvidia-utils-410           410.73-0ubuntu0~gp amd64              NVIDIA driver support binaries
un  nvidia-vdpau-driver        <none>             <none>             (no description available)
ii  nvidia-visual-profiler     9.1.85-4ubuntu1    amd64              NVIDIA Visual Profiler for CUDA and OpenCL
ii  xserver-xorg-video-nvidia- 410.73-0ubuntu0~gp amd64              NVIDIA binary Xorg driver
dpkg-query: no packages found matching *nvidia*rpm
dpkg-query: no packages found matching -qa
  • NVIDIA container library version from nvidia-container-cli -V
version: 1.0.0
build date: 2018-09-20T20:19+00:00
build revision: 881c88e2e5bb682c9bb14e68bd165cfb64563bb1
build compiler: x86_64-linux-gnu-gcc-7 7.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
  • NVIDIA container library logs (see troubleshooting)
  • Docker command, image and tag used
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
@alexanderfrey
Copy link
Author

Do I have to install any other libraries apart from the nvidia drivers on the host machine ?

@JanuszBartosz
Copy link

Bumping up. I am having exactly the same problem, also on Ubuntu 18.10 and driver version 390.87.

@symmsaur
Copy link

symmsaur commented Nov 2, 2018

I have similar symptoms but I can run nvidia-smi after executing ldconfig inside the container. I'm using driver version 410.73.

>docker run --runtime=nvidia --rm -it nvidia/cuda:9.0-base bash
root@9b2ab11c3ff9:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
root@9b2ab11c3ff9:/# ldconfig
root@9b2ab11c3ff9:/# nvidia-smi
Fri Nov  2 15:35:25 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.73       Driver Version: 410.73       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   49C    P8    N/A /  N/A |    289MiB /  4040MiB |     14%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

@alexanderfrey
Copy link
Author

@symmsaur Can you import TensorFlow in the docker after ldconfig ?

@flx42
Copy link
Member

flx42 commented Nov 2, 2018

Mmm, given the symptoms, you are probably stumbling in the issue that was fixed by this commit:
NVIDIA/libnvidia-container@deccb28
You would need to wait for the next release of the library.

@alexanderfrey
Copy link
Author

alexanderfrey commented Nov 2, 2018

Mmm, given the symptoms, you are probably stumbling in the issue that was fixed by this commit:
NVIDIA/libnvidia-container@deccb28
You would need to wait for the next release of the library.

Thanks for the information. I will wait for the next release or compile libnvidia-container myself if I can't resolve it until then. Running ldconfig manually helped though !

Many thanks to @symmsaur...

Best

@RenaudWasTaken RenaudWasTaken changed the title NVIDIA-SMI couldn't find libnvidia-ml.so library in your system Couldn't find libnvidia-ml.so library in your system Dec 20, 2018
@ddurnev
Copy link

ddurnev commented Dec 27, 2018

The same for me.

docker run --runtime=nvidia --rm nvidia/cuda:9.0-base ldconfig && nvidia-smi

works, without ldconfig fails with the same error.

Running ldconfig inside the container fixes any issue with failing to resolve .so libraries (actually resolving on the host system): so tensorflow(image nvcr.io/nvidia/tensorflow:18.09-py3) imports and runs fine after that.

Confirm that the above mentioned commit fixes the problem: re-compiled the latest master branch and replaced the library in my system path - now ngc tensorflow image works out of the box on ubuntu 18.10, nvidia driver 415.25

@moniquelive
Copy link

moniquelive commented Dec 27, 2018

The same for me.

docker run --runtime=nvidia --rm nvidia/cuda:9.0-base ldconfig && nvidia-smi

@ddurnev you probably ran nvidia-smi on the host (compare the Processes list inside and out of the container)

@ddurnev
Copy link

ddurnev commented Dec 28, 2018

@lccro Yes, you're right - this runs only the first part "ldconfig" inside the container, correct is smth like:

docker run --runtime=nvidia --rm nvidia/cuda:9.0-base /bin/bash -c "ldconfig && nvidia-smi"

Still running nvidia-smi works without ldconfig only after the patch for libnvidia-container is applied.

@botalaszlo
Copy link

I have the same problem on Fedora 29, with nvidia 415 driver and nvidia-docker 2.0.3

$ sudo docker run --runtime=nvidia --rm nvidia/cuda:10.0-base nvidia-smi
Unable to find image 'nvidia/cuda:10.0-base' locally
10.0-base: Pulling from nvidia/cuda
473ede7ed136: Pull complete 
c46b5fa4d940: Pull complete 
93ae3df89c92: Pull complete 
6b1eed27cade: Pull complete 
cb5511f09cc0: Pull complete 
4173c1e5c714: Pull complete 
Digest: sha256:7ba25f8ec32821f4225a73d6cd3df5ccf70ecc9622724f64c61b123f2bde5b90
Status: Downloaded newer image for nvidia/cuda:10.0-base
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

But on host it works well:

$ nvidia-smi 
Thu Jan  3 14:11:28 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.25       Driver Version: 415.25       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:09:00.0  On |                  N/A |
| 28%   30C    P8     8W / 180W |    341MiB /  8116MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1328      G   /usr/libexec/Xorg                             40MiB |
|    0      1510      G   /usr/bin/gnome-shell                          48MiB |
|    0      1806      G   /usr/libexec/Xorg                            126MiB |
|    0      1922      G   /usr/bin/gnome-shell                         122MiB |
+-----------------------------------------------------------------------------+

Additonal informations about nvidia card:

$ whereis nvidia-smi
nvidia-smi: /usr/bin/nvidia-smi /usr/share/man/man1/nvidia-smi.1.gz
$ nvidia-installer -v |grep version
nvidia-installer:  version 415.25
$ uname -a
Linux localhost.localdomain 4.19.13-300.fc29.x86_64 #1 SMP Sat Dec 29 22:54:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ lspci |grep -E "VGA|3D"
09:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)

More infoirmation about nvidia docker:

$nvidia-docker version
NVIDIA Docker: 2.0.3

I have followed this guide for nvidia driver installation process.

@andyneff
Copy link

andyneff commented Jan 4, 2019

@botalaszlo I have the same problem on Fedora 29 after dnf update today.

Running ldconfig in the container does make it work, the so file is found in /usr/local/cuda-9.2/targets/x86_64-linux/lib/stubs/libnvidia-ml.so

Does this work for you?

docker run --runtime=nvidia --rm nvidia/cuda:10.0-base bash -c "ldconfig; nvidia-smi"

@botalaszlo
Copy link

@andyneff Perfect! This works fine.

$ docker run --runtime=nvidia --rm nvidia/cuda:10.0-base bash -c "ldconfig; nvidia-smi"
Fri Jan  4 16:43:54 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.25       Driver Version: 415.25       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:09:00.0  On |                  N/A |
| 35%   49C    P8     8W / 180W |    254MiB /  8116MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Maybe the documentation should be updated with your notice :)

@andyneff
Copy link

andyneff commented Jan 8, 2019

@botalaszlo It's not a documentation bug, you shouldn't have to run ldconfig, this is a bug, and manually running ldconfig is just a workaround of the ldconfig cache not being right in the image.

@andyneff
Copy link

andyneff commented Jan 9, 2019

I just found out today the hard way that this bug affects more than just nvidia stuff.

docker run --runtime=nvidia --rm nvidia/cuda:10.0-base ldconfig -p
0 libs found in cache `/etc/ld.so.cache'

This breaks anything in python that uses find_library (a lot), if not everything ld cache related.

@flx42 Any idea when the next release will be?

@RenaudWasTaken
Copy link
Contributor

This should be fixed with the latest version of the libnvidia-container packages.
Closing, feel free to reopen if the bug persists.

@andyneff
Copy link

Tested on Fedora 29, updated

  • docker 18-09.1.ce
  • nvidia-docker 2.0.3-1.docker18.09.0 -> 2.0.3-1.docker18.09.0
  • nvidia-container-runtime 2.0.0-1.docker18.09.0 -> 2.0.0-1.docker18.09.1
  • libnvidia-container1 1.0.0-1 -> 1.0.1-1
  • kernel 4.19.13 -> 4.19.15

After the update, confirmed fixed! Thanks @RenaudWasTaken

@edoardogiacomello
Copy link

I'm still experiencing this bug, running ldconfig makes nvidia-smi work.

  • OS: Ubuntu 18.4 LTS
  • Docker version 18.09.3
  • nvidia docker 2.0.3+docker18.09.3-1
  • nvidia-container-runtime 2.0.0+docker18.09.3-1
  • libnvidia-container 1.0.1-1
  • kernel 4.18.0-16

How can make it work without running ldconfig first?

@andyneff
Copy link

Just supplying more (possibly useless) info. Still working on Fedora:

  • OS: Fedora 29
  • Docker version 18.09.3
  • nvidia-docker 2.0.3-1.docker18.09.3.ce
  • nvidia-container-runtime 2.0.0-1.docker18.09.3
  • libnvidia-container 1.0.1-1
  • kernel 4.20.14-200

Test

docker run --runtime=nvidia --rm nvidia/cuda@sha256:3cba5c5a8f37ba05b2710071907bd8da22ad1dc828025687b2435b1308a138ff nvidia-smi #that's today's digest id for tag 10.0-base

@jjacobelli
Copy link
Contributor

jjacobelli commented Mar 25, 2019

@edoardogiacomello What is your current version of ld ?

@edoardogiacomello
Copy link

@edoardogiacomello What is your current version of ld ?

on the host I got: GNU ld (GNU Binutils for Ubuntu) 2.30
inside the docker container: GNU ld (GNU Binutils for Ubuntu) 2.26.1

@Brainiarc7
Copy link

Yep, same issue here with the latest version:

  1. Without ldconfig:
docker run --gpus all nvidia/cuda:10.1-base nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

  1. And with ldconfig:
docker run --gpus all nvidia/cuda:10.1-base ldconfig && nvidia-smi
Fri Aug 30 14:11:27 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2080    Off  | 00000000:01:00.0  On |                  N/A |
| N/A   48C    P8     5W /  N/A |    223MiB /  7973MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1942      G   /usr/lib/xorg/Xorg                            18MiB |
|    0      2057      G   /usr/bin/gnome-shell                          57MiB |
|    0      2936      G   /usr/lib/xorg/Xorg                            69MiB |
|    0      3073      G   /usr/bin/gnome-shell                          76MiB |
+-----------------------------------------------------------------------------+

So yeah, still broken:

docker --version
Docker version 19.03.1, build 74b1e89

@mash-graz
Copy link

mash-graz commented Sep 18, 2019

So yeah, still broken:

yes -- i'm also stumbling over this bug on debian testing:

~$ docker run --gpus all nvidia/cuda nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

vs.

~$ docker run --gpus all nvidia/cuda ldconfig && nvidia-smi
Wed Sep 18 17:13:19 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40       Driver Version: 430.40       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 107...  On   | 00000000:01:00.0 Off |                  N/A |
| 33%   29C    P8     6W / 180W |      1MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
~$ docker version
Client: Docker Engine - Community
 Version:           19.03.2
 API version:       1.40
 Go version:        go1.12.8
 Git commit:        6a30dfc
 Built:             Thu Aug 29 05:29:29 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.2
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.8
  Git commit:       6a30dfc
  Built:            Thu Aug 29 05:28:05 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.6
  GitCommit:        894b81a4b802e4eb2a91d1ce216b8817763c29fb
 runc:
  Version:          1.0.0-rc8
  GitCommit:        425e105d5a03fabd737a126ad93d62a9eeede87f
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

libnvidia-container1:amd64/buster 1.0.5-1 uptodate

the ldconfig workaround doesn't look acceptable...
how can we finally fix this long term issue?

@glennie
Copy link

glennie commented Sep 24, 2019

Hi "nvidia",
Can you provide an eta on this?
It is really painful to run ldconfig on each command (and enable root access/sudo without password in the container).
Many thanks and kind regards.

@nvjmayo
Copy link
Contributor

nvjmayo commented Oct 3, 2019

Works for me on Ubuntu 18.04 and Debian 10.

Here's a run from scratch on Debian 10 (looks the same for me on Ubuntu as well), without ever manually doing ldconfig. I'm using nvidia-container-toolkit and have removed the old nvidia-docker2:

~/ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Unable to find image 'nvidia/cuda:9.0-base' locally
9.0-base: Pulling from nvidia/cuda
f7277927d38a: Pull complete
8d3eac894db4: Pull complete
edf72af6d627: Pull complete
3e4f86211d23: Pull complete
d6e9603ff777: Pull complete
9454aa7cddfc: Pull complete
a296dc1cdef1: Pull complete
Digest: sha256:1883759ad42016faba1e063d6d86d5875cecf21c420a5c1c20c27c41e46dae44
Status: Downloaded newer image for nvidia/cuda:9.0-base
Thu Oct  3 17:44:45 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GT 710      Off  | 00000000:65:00.0 N/A |                  N/A |
| 50%   39C    P0    N/A /  N/A |      0MiB /  2001MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+

I recommend uninstalling and re-installing the driver and packages. it's possible your host system is in a strange state and it's impacting something in your setup.

@glennie
Sorry, we don't see this issue on our end. Without a better understanding of what problem you're specifically facing, we can't offer an ETA on a fix.

@andyneff
Copy link

andyneff commented Oct 7, 2019

@glennie @mash-graz @Brainiarc7 do any of you get same result if you use nvjmayo's exact same sha?

docker run --runtime=nvidia --rm nvidia/cuda@sha256:1883759ad42016faba1e063d6d86d5875cecf21c420a5c1c20c27c41e46dae44 nvidia-smi

@mash-graz
Copy link

@andyneff

your cmd line produces this error message on my machine:

local@bonsai:~$  docker run --runtime=nvidia --rm nvidia/cuda@sha256:1883759ad42016faba1e063d6d86d5875cecf21c420a5c1c20c27c41e46dae44 nvidia-smi
docker: Error response from daemon: Unknown runtime specified nvidia.
See 'docker run --help'.

using the --gpu option instead produces:

local@bonsai:~$ docker run --gpus all --rm nvidia/cuda@sha256:1883759ad42016faba1e063d6d86d5875cecf21c420a5c1c20c27c41e46dae44 nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

and manually adding ldconfig finally works again:

local@bonsai:~$ docker run --gpus all --rm nvidia/cuda@sha256:1883759ad42016faba1e063d6d86d5875cecf21c420a5c1c20c27c41e46dae44 ldconfig && nvidia-smi
Mon Oct  7 19:10:11 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 107...  On   | 00000000:01:00.0 Off |                  N/A |
| 33%   26C    P8     6W / 180W |      1MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

but i should perhaps mention, that i do not use the then nvida-drivers on this machine for the actual video output. i prefer the utilize the onboard intel chip for this purpose, because i'm otherwise not able to share the graphic card by PCIe--passthrough by qemu-kvm instances and mostly need the the nvidia card only for CUDA based GPGPU stuff. therefore the setup could slightly differ from other installations.

@glennie
Copy link

glennie commented Oct 7, 2019

Hello,

Hello,

~/ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

May be I'm missing something here... Why are you (@nvjmayo) using --runtime option?

I used --gpus all (as I've got docker 19.03.2).

Using the sha256 specified by @andyneff with --gpu all I still have the same issue:

[glennie@hestia` ~]$ docker run --gpus all --rm nvidia/cuda@sha256:1883759ad42016faba1e063d6d86d5875cecf21c420a5c1c20c27c41e46dae44 nvidia-smi NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.

But, it works when I use ldconfig before:

Mon Oct  7 19:12:49 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce MX130       Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   53C    P0    N/A /  N/A |      0MiB /  2004MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+```

Kind regards,

@nvjmayo
Copy link
Contributor

nvjmayo commented Nov 7, 2019

May be I'm missing something here... Why are you (@nvjmayo) using --runtime option?

My mistake, I have multiple runtimes installed for a bunch of different environments (both for docker and podman). I should have pasted the canonical form. Sorry for the confusion.

But, it works when I use ldconfig before:

I'll ask the team to bump up the priority on fixing this. It's an issue of at what stage to run the container hooks. Automatically running ldconfig when needed is something we're looking into. When to do it, what mechanism to use to do it, and if we should stop a running container are all open questions for an implementation.

The best way to work around the issue right now is to run ldconfig on the container whenever you upgrade your host driver. Admittedly inconvenient.

@RenaudWasTaken
Copy link
Contributor

Hello!

Can you give us a few more information?

  • uname -a
  • ldconfig --version

Thanks!

@mash-graz
Copy link

mash-graz commented Nov 8, 2019

Can you give us a few more information?

uname -a

~$ uname -a
Linux bonsai 5.2.0-3-amd64 #1 SMP Debian 5.2.17-1 (2019-09-26) x86_64 GNU/Linux

ldconfig --version

~$ sudo ldconfig --version
ldconfig (Debian GLIBC 2.29-2) 2.29

i hope, that helps!

btw.: i'm using debian testing as a rolling release solution, which isn't uncommon in case of GPGPU utilization for ML tasks, because the software in the stable debian branch is usually to much outdated for the requirements resp. fast progress in this field.

@lyon667
Copy link

lyon667 commented Jan 5, 2020

Can you try replacing "@/sbin/ldconfig" with "/sbin/ldconfig" in /etc/nvidia-container-runtime/config.toml? Not sure why but this helped in my case.

@mash-graz
Copy link

is the reason resp. actual meaning of this "@"-syntax used in https://gitlab.com/nvidia/container-toolkit/toolkit/blob/master/config/config.toml.debian somewhere documented or explained?

@regzon
Copy link

regzon commented Jan 8, 2020

Can you try replacing "@/sbin/ldconfig" with "/sbin/ldconfig" in /etc/nvidia-container-runtime/config.toml? Not sure why but this helped in my case.

Thank you, works for me. I am using Debian Testing (same as @mash-graz).
Is there gonna be an official update with a fix from Nvidia?

@marcinz
Copy link

marcinz commented Feb 19, 2020

Thank you @lyon667. It worked for me as well after wasting many hours of my time. Why does this work @RenaudWasTaken?

@martinmCGG
Copy link

is the reason resp. actual meaning of this "@"-syntax used in https://gitlab.com/nvidia/container-toolkit/toolkit/blob/master/config/config.toml.debian somewhere documented or explained?

I did not find any documentation but it seems to be processed here in nvc_ldcache_update in libnvidia-container.

@netbrain
Copy link

netbrain commented Sep 8, 2020

Can you try replacing "@/sbin/ldconfig" with "/sbin/ldconfig" in /etc/nvidia-container-runtime/config.toml? Not sure why but this helped in my case.

this solved it for me.

@mash-graz
Copy link

this solved it for me.

yes! -- this manual removal of the @-sign works for me as well.

i also don't understand, why this particular issue still isn't fixed in the released nvida-docker packages and still affects debian installations?

1mckenna added a commit to 1mckenna/crackerjack-docker that referenced this issue Dec 13, 2020
@e7d
Copy link

e7d commented Apr 18, 2021

Can you try replacing "@/sbin/ldconfig" with "/sbin/ldconfig" in /etc/nvidia-container-runtime/config.toml? Not sure why but this helped in my case.

Had the same problem on a fresh Debian 10 install (openmediavault 5.6.2). Solved the problem for me too. Do we have a patch here or is this on Debian side?

@ichsan2895
Copy link

my error was solved by doing this way

This is the answer :

This led me to finding another solution by looking into /etc/nvidia-container-runtime/config.toml file where the ldconfig is by default set to "@/sbin/ldconfig". This for some reason seems to not be working and also produces the error above:

root@banshee:/var/log# docker run --rm --gpus=all nvidia/cuda:11.4-base nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

Changing the ldconfig path to "/sbin/ldconfig" (instead of "@/sbin/ldconfig") does indeed fix the problem:

root@banshee:/var/log# docker run --rm --gpus=all nvidia/cuda:11.4-base nvidia-smi
Sun Jan  5 20:39:45 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 970     On   | 00000000:01:00.0  On |                  N/A |
| 32%   39C    P8    16W / 170W |    422MiB /  4038MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

@klueska
Copy link
Contributor

klueska commented Mar 22, 2022

The newest version of nvidia-docker should resolve these issues with ldconfig not properly setting up the library search path on debian systems before a container gets launched.

Specifically this change in libnvidia-container fixes the issue and is included as part of the latest release:
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/141

The latest release packages for the full nvidia-docker stack:

libnvidia-container1-1.9.0
libnvidia-container-tools-1.9.0
nvidia-container-toolkit-1.9.0
nvidia-container-runtime-3.9.0
nvidia-docker-2.10.0

@wajeehulhassanvii
Copy link

@botalaszlo I have the same problem on Fedora 29 after dnf update today.

Running ldconfig in the container does make it work, the so file is found in /usr/local/cuda-9.2/targets/x86_64-linux/lib/stubs/libnvidia-ml.so

Does this work for you?

docker run --runtime=nvidia --rm nvidia/cuda:10.0-base bash -c "ldconfig; nvidia-smi"

If the above works then follow this article, set no-cgroups = false and ldconfig = "/sbin/ldconfig" in /etc/nvidia-container-runtime/config.toml, hopefully it will solve the problem. Worked for me.

@Xosrov
Copy link

Xosrov commented Nov 4, 2023

After HOURS of wasted time I found out the problem is Docker being installed from snap... Removing the snap version fixed it for me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests