Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system #1163

Closed
9 tasks
markj24 opened this issue Dec 23, 2019 · 35 comments
Closed
9 tasks

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system #1163

markj24 opened this issue Dec 23, 2019 · 35 comments

Comments

@markj24
Copy link

markj24 commented Dec 23, 2019

1. Issue or feature description

receive error NVIDIA-SMI couldn't find libnvidia-ml.so library in your system when running nvidia-smi within cointainer. i'm sure the driver is installed correctly as i get the correct output from nvidia-smi when run on the host. running ldconfig within the container corrects this temporarily until the container is updated

2. Steps to reproduce the issue

docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
  • Kernel version from uname -a

Linux openmediavault.local 5.3.0-0.bpo.2-amd64 #1 SMP Debian 5.3.9-2~bpo10+1 (2019-11-13) x86_64 GNU/Linux

  • Any relevant kernel output lines from dmesg
  • Driver information from nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Mon Dec 23 17:11:55 2019
Driver Version : 440.44
CUDA Version : 10.2

Attached GPUs : 1
GPU 00000000:83:00.0
Product Name : Quadro P2000
Product Brand : Quadro
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1422019086300
GPU UUID : GPU-67caad7d-2744-4ec8-7a48-e17278af1025
Minor Number : 0
VBIOS Version : 86.06.74.00.01
MultiGPU Board : No
Board ID : 0x8300
GPU Part Number : 900-5G410-1700-000
Inforom Version
Image Version : G410.0502.00.02
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x83
Device : 0x00
Domain : 0x0000
Device Id : 0x1C3010DE
Bus Id : 00000000:83:00.0
Sub System Id : 0x11B310DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 64 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 5059 MiB
Used : 0 MiB
Free : 5059 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 2 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Temperature
GPU Current Temp : 35 C
GPU Shutdown Temp : 104 C
GPU Slowdown Temp : 101 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 17.71 W
Power Limit : 75.00 W
Default Power Limit : 75.00 W
Enforced Power Limit : 75.00 W
Min Power Limit : 75.00 W
Max Power Limit : 75.00 W
Clocks
Graphics : 1075 MHz
SM : 1075 MHz
Memory : 3499 MHz
Video : 999 MHz
Applications Clocks
Graphics : 1075 MHz
Memory : 3504 MHz
Default Applications Clocks
Graphics : 1075 MHz
Memory : 3504 MHz
Max Clocks
Graphics : 1721 MHz
SM : 1721 MHz
Memory : 3504 MHz
Video : 1556 MHz
Max Customer Boost Clocks
Graphics : 1721 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None

  • Docker version from docker version

Client: Docker Engine - Community
Version: 19.03.5
API version: 1.40
Go version: go1.12.12
Git commit: 633a0ea838
Built: Wed Nov 13 07:25:38 2019
OS/Arch: linux/amd64
Experimental: false

Server: Docker Engine - Community
Engine:
Version: 19.03.5
API version: 1.40 (minimum version 1.12)
Go version: go1.12.12
Git commit: 633a0ea838
Built: Wed Nov 13 07:24:09 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.10
GitCommit: b34a5c8af56e510852c35414db4c1f4fa6172339
runc:
Version: 1.0.0-rc8+dev
GitCommit: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
docker-init:
Version: 0.18.0
GitCommit: fec3683

  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'

||/ Name Version Architecture Description
+++-=============================-============-============-=====================================================
ii libnvidia-container-tools 1.0.5-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.0.5-1 amd64 NVIDIA container runtime library
ii nvidia-container-runtime 3.1.4-1 amd64 NVIDIA container runtime
un nvidia-container-runtime-hook (no description available)
ii nvidia-container-toolkit 1.0.5-1 amd64 NVIDIA container runtime hook

  • NVIDIA container library version from nvidia-container-cli -V

version: 1.0.5
build date: 2019-09-06T16:59+00:00
build revision: 13b836390888f7b7c7dca115d16d7e28ab15a836
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

  • NVIDIA container library logs (see troubleshooting)
  • Docker command, image and tag used

docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

@lyon667
Copy link

lyon667 commented Jan 5, 2020

I'm hitting it as well on a very similar setup, i.e. Debian 10 Buster with kernel 5.3.9 from backports with identical version of nvidia-container* packages but different NVIDIA driver version 430.64. Also this issue seems to be a clone of #854 which is however closed without being resolved.

The error actually seems to stem from a missing ldconfig binary which is odd because it is definitely present in the container /sbin directory:

root@banshee:/var/log# docker run --rm --gpus=all nvidia/cuda:9.2-base ls -la /sbin/ | grep ldconfig
-rwxr-xr-x  1 root root       387 Feb  5  2019 ldconfig
-rwxr-xr-x  1 root root   1000608 Feb  5  2019 ldconfig.real

This error is however logged to nvidia-container-toolkit.log with debugging enabled:

I0105 19:55:43.487585 13429 nvc_ldcache.c:353] executing /sbin/ldconfig from host at /var/lib/docker/devicemapper/mnt/c73813553175c31ea9be80cb4c9ded21edf532a67639988fa8ee78c2a632c777/rootfs
E0105 19:55:43.488469 1 nvc_ldcache.c:384] could not start /sbin/ldconfig: process execution failed: no such file or directory

This led me to finding another solution by looking into /etc/nvidia-container-runtime/config.toml file where the ldconfig is by default set to "@/sbin/ldconfig". This for some reason seems to not be working and also produces the error above:

root@banshee:/var/log# docker run --rm --gpus=all nvidia/cuda:9.2-base nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

Changing the ldconfig path to "/sbin/ldconfig" does indeed fix the problem:

root@banshee:/var/log# docker run --rm --gpus=all nvidia/cuda:9.2-base nvidia-smi
Sun Jan  5 20:39:45 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64       Driver Version: 430.64       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 970     On   | 00000000:01:00.0  On |                  N/A |
| 32%   39C    P8    16W / 170W |    422MiB /  4038MiB |      3%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I am however pretty sure that the default has been working for me before with NVIDIA driver version 418.74 but I cannot confirm the driver version is the cause of problem here.

@markj24
Copy link
Author

markj24 commented Jan 5, 2020

that did indeed fix the problem for me.

thanks for the help

@markj24 markj24 closed this as completed Jan 5, 2020
@lyon667
Copy link

lyon667 commented Jan 5, 2020

@markj24 I would leave this bug open until someone figures out why the defaults don't work.

@markj24 markj24 reopened this Jan 5, 2020
@RenaudWasTaken
Copy link
Contributor

Hello and sorry for the delay!

Executing processes in the container is a pretty dangerous operation and by default we use the host ldconfig (@/sbin/ldconfig.real), if it's not present there isn't any fallback to the container ldconfig unless you manually specify it (by replacing it to /sbin/ldconfig.real in the /etc/nvidia-container-runtime/config.toml file).

Hope this helps you!

@lyon667
Copy link

lyon667 commented Jan 9, 2020

Hi @RenaudWasTaken and thank you for your response. Does it mean that the host /sbin/ldconfig.real is the default regardless of configuration in config.toml file?

On Debian there's only /sbin/ldconfig present, not the .real wrapper script which is likely Ubuntu specific. Maybe the debian-specific build needs to be tweaked in this case? The repository I'm using is https://nvidia.github.io/libnvidia-container/debian10/amd64

Cheers!

@RenaudWasTaken RenaudWasTaken reopened this Jan 9, 2020
@RenaudWasTaken
Copy link
Contributor

Yep that's probably what is happening, I'll take a deeper look later this week.
Reopening for now

@RenaudWasTaken
Copy link
Contributor

Sorry it took me so long to get back to you but by default we should select the correct ldconfig on debian: https://github.com/NVIDIA/container-toolkit/blob/master/config/config.toml.debian

You can also see that by extracting the tarball:

➜  tmp.EQlpOAlgVh wget https://github.com/NVIDIA/nvidia-container-runtime/raw/gh-pages/debian10/amd64/nvidia-container-toolkit_1.0.5-1_amd64.deb
--2020-02-10 14:06:28--  https://github.com/NVIDIA/nvidia-container-runtime/raw/gh-pages/debian10/amd64/nvidia-container-toolkit_1.0.5-1_amd64.deb
Resolving github.com (github.com)... 192.30.255.113
Connecting to github.com (github.com)|192.30.255.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/NVIDIA/nvidia-container-runtime/gh-pages/debian10/amd64/nvidia-container-toolkit_1.0.5-1_amd64.deb [following]
--2020-02-10 14:06:28--  https://raw.githubusercontent.com/NVIDIA/nvidia-container-runtime/gh-pages/debian10/amd64/nvidia-container-toolkit_1.0.5-1_amd64.deb
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.40.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.40.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 575548 (562K) [application/octet-stream]
Saving to: ‘nvidia-container-toolkit_1.0.5-1_amd64.deb’

nvidia-container-toolkit_ 100%[=====================================>] 562,06K  --.-KB/s    in 0,02s   

2020-02-10 14:06:28 (22,6 MB/s) - ‘nvidia-container-toolkit_1.0.5-1_amd64.deb’ saved [575548/575548]

➜  tmp.EQlpOAlgVh dk
➜  tmp.EQlpOAlgVh dpkg-deb -x nvidia-container-toolkit_1.0.5-1_amd64.deb .
➜  tmp.EQlpOAlgVh tree
.
├── etc
│   └── nvidia-container-runtime
│       └── config.toml
├── nvidia-container-toolkit_1.0.5-1_amd64.deb
└── usr
    ├── bin
    │   └── nvidia-container-toolkit
    └── share
        ├── doc
        │   └── nvidia-container-toolkit
        │       ├── changelog.Debian.gz
        │       └── copyright
        └── lintian
            └── overrides
                └── nvidia-container-toolkit

9 directories, 6 files
➜  tmp.EQlpOAlgVh cat etc/nvidia-container-runtime/config.toml 
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
➜  tmp.EQlpOAlgVh 

Feel free to reopen if you encounter on another machine!

@ighack
Copy link

ighack commented Jun 30, 2020

@lyon667
thanks, it works for me

@bingzhangdai
Copy link

I am encountered the same problem. ldconfig = "@/sbin/ldconfig" is not correct for Debian10. Changing it to ldconfig = "/sbin/ldconfig" will work.

Sorry it took me so long to get back to you but by default we should select the correct ldconfig on debian: https://github.com/NVIDIA/container-toolkit/blob/master/config/config.toml.debian

Does that mean the ldconfig path will change on Debian in later release?

@klueska
Copy link
Contributor

klueska commented Oct 5, 2020

@bingzhangdai Note that not using the @ symbol will cause it to use /sbin/ldconfig from inside the contiainer, rather than using the host ldconfig. If a container you are launching does not have ldconfig installed in this location, then it could still cause problems.

@bingzhangdai
Copy link

@klueska Oh, thanks for pointing out. On my host (Debian 10), /sbin/ldconfig exists but ldconfig = "@/sbin/ldconfig" is not working.

root@omv:~# cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig"
#alpha-merge-visible-devices-envvars = false

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
root@omv:~#
root@omv:~# ls -l /sbin/ldconfig
-rwxr-xr-x 1 root root 909096 May  2  2019 /sbin/ldconfig
root@omv:~#
root@omv:~# docker run --rm --gpus all nvidia/cuda:10.0-base bash -c "nvidia-smi"
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

According to the discussion above, I understand it is not good to execute ldconfig inside the container, but how can I properly use the host's ldconfig?

@klueska
Copy link
Contributor

klueska commented Oct 5, 2020

Can you flip on the setting for:

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"

Run the container again, and post the output of /var/log/nvidia-container-runtime.log.

@bingzhangdai
Copy link

I enabled all the logs in the config file, but /var/log/nvidia-container-runtime.log does not show up. Restarting docker service is the same. Here is /var/log/nvidia-container-toolkit.log.

-- WARNING, the following logs are for debugging purposes only --

I1005 11:43:18.098206 22796 nvc.c:282] initializing library context (version=1.2.0, build=d22237acaea94aa5ad5de70aac903534ed598819)
I1005 11:43:18.098251 22796 nvc.c:256] using root /
I1005 11:43:18.098257 22796 nvc.c:257] using ldcache /etc/ld.so.cache
I1005 11:43:18.098262 22796 nvc.c:258] using unprivileged user 65534:65534
I1005 11:43:18.098274 22796 nvc.c:299] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I1005 11:43:18.098362 22796 nvc.c:301] dxcore initialization failed, continuing assuming a non-WSL environment
I1005 11:43:18.100032 22801 nvc.c:192] loading kernel module nvidia
I1005 11:43:18.100241 22801 nvc.c:204] loading kernel module nvidia_uvm
I1005 11:43:18.100328 22801 nvc.c:212] loading kernel module nvidia_modeset
I1005 11:43:18.100581 22802 driver.c:101] starting driver service
I1005 11:43:18.630851 22796 nvc_container.c:364] configuring container with 'compute utility supervised'
I1005 11:43:18.631067 22796 nvc_container.c:212] selecting /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/local/cuda-10.0/compat/libcuda.so.410.129
I1005 11:43:18.631128 22796 nvc_container.c:212] selecting /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/local/cuda-10.0/compat/libnvidia-fatbinaryloader.so.410.129
I1005 11:43:18.631167 22796 nvc_container.c:212] selecting /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/local/cuda-10.0/compat/libnvidia-ptxjitcompiler.so.410.129
I1005 11:43:18.631362 22796 nvc_container.c:384] setting pid to 22789
I1005 11:43:18.631370 22796 nvc_container.c:385] setting rootfs to /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged
I1005 11:43:18.631375 22796 nvc_container.c:386] setting owner to 0:0
I1005 11:43:18.631381 22796 nvc_container.c:387] setting bins directory to /usr/bin
I1005 11:43:18.631386 22796 nvc_container.c:388] setting libs directory to /usr/lib/x86_64-linux-gnu
I1005 11:43:18.631391 22796 nvc_container.c:389] setting libs32 directory to /usr/lib/i386-linux-gnu
I1005 11:43:18.631396 22796 nvc_container.c:390] setting cudart directory to /usr/local/cuda
I1005 11:43:18.631401 22796 nvc_container.c:391] setting ldconfig to @/sbin/ldconfig (host relative)
I1005 11:43:18.631406 22796 nvc_container.c:392] setting mount namespace to /proc/22789/ns/mnt
I1005 11:43:18.631411 22796 nvc_container.c:394] setting devices cgroup to /sys/fs/cgroup/devices/docker/f827f60d96ce61b2e2a889fbc8a7f136b3c2b2d87510647fcd792510a2db0c66
I1005 11:43:18.631432 22796 nvc_info.c:679] requesting driver information with ''
I1005 11:43:18.633089 22796 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.440.100
I1005 11:43:18.633205 22796 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.440.100
I1005 11:43:18.633266 22796 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.440.100
I1005 11:43:18.633301 22796 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.440.100
I1005 11:43:18.633373 22796 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.440.100
I1005 11:43:18.633430 22796 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.440.100
I1005 11:43:18.633499 22796 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.440.100
I1005 11:43:18.633534 22796 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.440.100
I1005 11:43:18.633593 22796 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.440.100
I1005 11:43:18.633780 22796 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.440.100
I1005 11:43:18.633904 22796 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.440.100
I1005 11:43:18.633976 22796 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.440.100
W1005 11:43:18.634020 22796 nvc_info.c:349] missing library libnvidia-cfg.so
W1005 11:43:18.634027 22796 nvc_info.c:349] missing library libnvidia-opencl.so
W1005 11:43:18.634032 22796 nvc_info.c:349] missing library libnvidia-allocator.so
W1005 11:43:18.634037 22796 nvc_info.c:349] missing library libnvidia-compiler.so
W1005 11:43:18.634042 22796 nvc_info.c:349] missing library libnvidia-ngx.so
W1005 11:43:18.634046 22796 nvc_info.c:349] missing library libvdpau_nvidia.so
W1005 11:43:18.634051 22796 nvc_info.c:349] missing library libnvidia-opticalflow.so
W1005 11:43:18.634055 22796 nvc_info.c:349] missing library libnvidia-fbc.so
W1005 11:43:18.634060 22796 nvc_info.c:349] missing library libnvidia-ifr.so
W1005 11:43:18.634065 22796 nvc_info.c:349] missing library libnvidia-rtcore.so
W1005 11:43:18.634069 22796 nvc_info.c:349] missing library libnvoptix.so
W1005 11:43:18.634074 22796 nvc_info.c:349] missing library libGLESv2_nvidia.so
W1005 11:43:18.634079 22796 nvc_info.c:349] missing library libGLESv1_CM_nvidia.so
W1005 11:43:18.634083 22796 nvc_info.c:349] missing library libnvidia-glvkspirv.so
W1005 11:43:18.634088 22796 nvc_info.c:349] missing library libnvidia-cbl.so
W1005 11:43:18.634092 22796 nvc_info.c:353] missing compat32 library libnvidia-ml.so
W1005 11:43:18.634097 22796 nvc_info.c:353] missing compat32 library libnvidia-cfg.so
W1005 11:43:18.634102 22796 nvc_info.c:353] missing compat32 library libcuda.so
W1005 11:43:18.634106 22796 nvc_info.c:353] missing compat32 library libnvidia-opencl.so
W1005 11:43:18.634111 22796 nvc_info.c:353] missing compat32 library libnvidia-ptxjitcompiler.so
W1005 11:43:18.634116 22796 nvc_info.c:353] missing compat32 library libnvidia-fatbinaryloader.so
W1005 11:43:18.634120 22796 nvc_info.c:353] missing compat32 library libnvidia-allocator.so
W1005 11:43:18.634125 22796 nvc_info.c:353] missing compat32 library libnvidia-compiler.so
W1005 11:43:18.634129 22796 nvc_info.c:353] missing compat32 library libnvidia-ngx.so
W1005 11:43:18.634134 22796 nvc_info.c:353] missing compat32 library libvdpau_nvidia.so
W1005 11:43:18.634139 22796 nvc_info.c:353] missing compat32 library libnvidia-encode.so
W1005 11:43:18.634143 22796 nvc_info.c:353] missing compat32 library libnvidia-opticalflow.so
W1005 11:43:18.634148 22796 nvc_info.c:353] missing compat32 library libnvcuvid.so
W1005 11:43:18.634152 22796 nvc_info.c:353] missing compat32 library libnvidia-eglcore.so
W1005 11:43:18.634157 22796 nvc_info.c:353] missing compat32 library libnvidia-glcore.so
W1005 11:43:18.634161 22796 nvc_info.c:353] missing compat32 library libnvidia-tls.so
W1005 11:43:18.634166 22796 nvc_info.c:353] missing compat32 library libnvidia-glsi.so
W1005 11:43:18.634171 22796 nvc_info.c:353] missing compat32 library libnvidia-fbc.so
W1005 11:43:18.634175 22796 nvc_info.c:353] missing compat32 library libnvidia-ifr.so
W1005 11:43:18.634180 22796 nvc_info.c:353] missing compat32 library libnvidia-rtcore.so
W1005 11:43:18.634184 22796 nvc_info.c:353] missing compat32 library libnvoptix.so
W1005 11:43:18.634189 22796 nvc_info.c:353] missing compat32 library libGLX_nvidia.so
W1005 11:43:18.634194 22796 nvc_info.c:353] missing compat32 library libEGL_nvidia.so
W1005 11:43:18.634198 22796 nvc_info.c:353] missing compat32 library libGLESv2_nvidia.so
W1005 11:43:18.634203 22796 nvc_info.c:353] missing compat32 library libGLESv1_CM_nvidia.so
W1005 11:43:18.634207 22796 nvc_info.c:353] missing compat32 library libnvidia-glvkspirv.so
W1005 11:43:18.634212 22796 nvc_info.c:353] missing compat32 library libnvidia-cbl.so
I1005 11:43:18.634454 22796 nvc_info.c:275] selecting /usr/lib/nvidia/current/nvidia-smi
I1005 11:43:18.634490 22796 nvc_info.c:275] selecting /usr/lib/nvidia/current/nvidia-debugdump
W1005 11:43:18.634766 22796 nvc_info.c:375] missing binary nvidia-persistenced
W1005 11:43:18.634771 22796 nvc_info.c:375] missing binary nvidia-cuda-mps-control
W1005 11:43:18.634776 22796 nvc_info.c:375] missing binary nvidia-cuda-mps-server
I1005 11:43:18.634804 22796 nvc_info.c:437] listing device /dev/nvidiactl
I1005 11:43:18.634810 22796 nvc_info.c:437] listing device /dev/nvidia-uvm
I1005 11:43:18.634814 22796 nvc_info.c:437] listing device /dev/nvidia-uvm-tools
I1005 11:43:18.634819 22796 nvc_info.c:437] listing device /dev/nvidia-modeset
W1005 11:43:18.634840 22796 nvc_info.c:320] missing ipc /var/run/nvidia-persistenced/socket
W1005 11:43:18.634856 22796 nvc_info.c:320] missing ipc /tmp/nvidia-mps
I1005 11:43:18.634862 22796 nvc_info.c:744] requesting device information with ''
I1005 11:43:18.641629 22796 nvc_info.c:627] listing device /dev/nvidia0 (GPU-a9bfc242-ae6c-9044-93be-a2f791e8608a at 00000000:00:10.0)
I1005 11:43:18.641706 22796 nvc_mount.c:309] mounting tmpfs at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/proc/driver/nvidia
I1005 11:43:18.641994 22796 nvc_mount.c:77] mounting /usr/lib/nvidia/current/nvidia-smi at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/bin/nvidia-smi
I1005 11:43:18.642041 22796 nvc_mount.c:77] mounting /usr/lib/nvidia/current/nvidia-debugdump at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/bin/nvidia-debugdump
I1005 11:43:18.642157 22796 nvc_mount.c:77] mounting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.440.100 at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.440.100
I1005 11:43:18.642221 22796 nvc_mount.c:77] mounting /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.440.100 at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/lib/x86_64-linux-gnu/libcuda.so.440.100
I1005 11:43:18.642261 22796 nvc_mount.c:77] mounting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.440.100 at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.440.100
I1005 11:43:18.642305 22796 nvc_mount.c:77] mounting /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.440.100 at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.440.100
I1005 11:43:18.642321 22796 nvc_mount.c:489] creating symlink /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
I1005 11:43:18.642402 22796 nvc_mount.c:77] mounting /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/local/cuda-10.0/compat/libcuda.so.410.129 at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/lib/x86_64-linux-gnu/libcuda.so.410.129
I1005 11:43:18.642461 22796 nvc_mount.c:77] mounting /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/local/cuda-10.0/compat/libnvidia-fatbinaryloader.so.410.129 at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129
I1005 11:43:18.642498 22796 nvc_mount.c:77] mounting /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/local/cuda-10.0/compat/libnvidia-ptxjitcompiler.so.410.129 at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129
I1005 11:43:18.642536 22796 nvc_mount.c:173] mounting /dev/nvidiactl at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/dev/nvidiactl
I1005 11:43:18.642557 22796 nvc_mount.c:464] whitelisting device node 195:255
I1005 11:43:18.642595 22796 nvc_mount.c:173] mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/dev/nvidia-uvm
I1005 11:43:18.642615 22796 nvc_mount.c:464] whitelisting device node 243:0
I1005 11:43:18.642647 22796 nvc_mount.c:173] mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/dev/nvidia-uvm-tools
I1005 11:43:18.642660 22796 nvc_mount.c:464] whitelisting device node 243:1
I1005 11:43:18.642702 22796 nvc_mount.c:173] mounting /dev/nvidia0 at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/dev/nvidia0
I1005 11:43:18.642764 22796 nvc_mount.c:377] mounting /proc/driver/nvidia/gpus/0000:00:10.0 at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged/proc/driver/nvidia/gpus/0000:00:10.0
I1005 11:43:18.642780 22796 nvc_mount.c:464] whitelisting device node 195:0
I1005 11:43:18.642796 22796 nvc_ldcache.c:359] executing /sbin/ldconfig from host at /var/lib/docker/overlay2/c80d842980a1891c3414a00dc50c636058948073b88f4367667717cbad4b66ac/merged
E1005 11:43:18.643802 1 nvc_ldcache.c:390] could not start /sbin/ldconfig: process execution failed: no such file or directory
I1005 11:43:18.660318 22796 nvc.c:337] shutting down library context
I1005 11:43:18.724208 22802 driver.c:156] terminating driver service
I1005 11:43:18.724588 22796 driver.c:196] driver service terminated successfully

@klueska
Copy link
Contributor

klueska commented Oct 5, 2020

Hmmm. This line seems odd if you say you have an /sbin/ldconfig on your host.

I1005 11:43:18.098251 22796 nvc.c:256] using root /
...
E1005 11:43:18.643802 1 nvc_ldcache.c:390] could not start /sbin/ldconfig: process execution failed: no such file or directory

What is the output of the following on our host:

ls -la /sbin/ldconfig*

@bingzhangdai
Copy link

I am also wondering why the log says no such file or directory.

root@omv:~# ls -la /sbin/ldconfig*
-rwxr-xr-x 1 root root 909096 May  2  2019 /sbin/ldconfig

I think the comment above #1163 (comment) is similar. /sbin/ldconfig is present, but I do not know if the issue is related to /sbin/ldconfig.real which is not available on my host.

@brycelelbach
Copy link

This is impacting the NVIDIA CUDA C++ Core Libraries team, on Debian unstable, using the debian10 packages. Can you please re-open this issue?

@brycelelbach
Copy link

With the config file as shipped, e.g. with @/sbin/ldconfig, I get:

[16:31:10]:wash@voyager:/home/wash/development/nvidia/cuda_linux_p4/sw/gpgpu/thrust/ci:0:$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

Changing it from @/sbin/ldconfig to /sbin/ldconfig gives me a different error:

[16:30:09]:wash@voyager:/home/wash/development/nvidia/cuda_linux_p4/sw/gpgpu/thrust/ci:0:$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 127\\\\n\\\"\"": unknown.

@brycelelbach
Copy link

I also do not get /var/log/nvidia-container-runtime.log

@lvh
Copy link

lvh commented Oct 18, 2020

(I can confirm I'm getting the same behavior as @brycelelbach.)

@MingyaoLiu
Copy link

For me, the nvidia docker with cuda 11.0 has the same behaviour as @brycelelbach . However I tried 10.2 base docker, it works just fine.

@bingzhangdai
Copy link

bingzhangdai commented Nov 15, 2020

For me, the nvidia docker with cuda 11.0 has the same behaviour as @brycelelbach . However I tried 10.2 base docker, it works just fine.

@MingyaoLiu It is mainly because, changing from @/sbin/ldconfig to /sbin/ldconfig will result in calling the ldconfig inside the container (#1163 (comment)). Maybe 11.0 base image does not contain ldconfig inside it, whereas 10.2 does. I think it is still the nvidia-docker2's problem.

My workaround is puttiing ldconfig inside the docker image and changing from @/sbin/ldconfig to /sbin/ldconfig.

I am wondering if nvidia-dcoker team has plan to solve this problem. I would suggest someone reopen this issue, as so many developers are all encountered it.

@klueska
Copy link
Contributor

klueska commented Nov 15, 2020

This is a duplicate of: #1399
(please add your concerns over there).

That said, given that it works with 10.2 but not 11.0, I'm starting to think it may actually be related to:
NVIDIA/libnvidia-container#117 (comment)

I will test this out in the next few days.

@dikkepanda
Copy link

This is a duplicate of: #1399
(please add your concerns over there).

That said, given that it works with 10.2 but not 11.0, I'm starting to think it may actually be related to:
NVIDIA/libnvidia-container#117 (comment)

I will test this out in the next few days.

@klueska on my Debian it still only works with version 10.2. The command ldconfig gives no errors and I have also updated libnvidia-container to v1.3.1 but this didn't solve the problem.

@stiv-yakovenko
Copy link

I have same problem after installing cuda 11.2 from nvidia website. any solutions?

@uvr-jra
Copy link

uvr-jra commented Jan 19, 2021

Hi,

I'm facing the same issue on my Debian 10 with nvidia/cuda:11.x-base.

Is there any news concerning this issue ?

Thank you for your help

@suddenlyfleck
Copy link

suddenlyfleck commented Apr 22, 2021

Did you try changing "@/sbin/ldconfig" to "/sbin/ldconfig" in /etc/nvidia-container-runtime/config.toml as suggested by lyon667 above?

worked for me on debian bullseye

@bingzhangdai
Copy link

Did you try changing "@/sbin/ldconfig" to "/sbin/ldconfig" in /etc/nvidia-container-runtime/config.toml as suggested by lyon667 above?

worked for me on debian bullseye

please refer to this comment: #1163 (comment). we are expecting the better solution.

@ErenChan
Copy link

Hello and sorry for the delay!

Executing processes in the container is a pretty dangerous operation and by default we use the host ldconfig (@/sbin/ldconfig.real), if it's not present there isn't any fallback to the container ldconfig unless you manually specify it (by replacing it to /sbin/ldconfig.real in the /etc/nvidia-container-runtime/config.toml file).

Hope this helps you!

hi, I don't have the sudo permit, is there any way to make it work?

@f1yankees
Copy link

This issue popped up for me after upgrading from Debian 10 to Debian 11 and using the new nvidia-container-toolkit release. The fix on Debian 11 of dropping the "@" in the config file as suggested fixed it for me.

@idrisswill
Copy link

Did you try changing "@/sbin/ldconfig" to "/sbin/ldconfig" in /etc/nvidia-container-runtime/config.toml as suggested by lyon667 above?

worked for me on debian bullseye

worked for me on debian9

@JonsenDong
Copy link

export LD_LIBRARY_PATH=/usr/local/nvidia/lib64/
export LD_PRELOAD=/usr/local/nvidia/lib64/libnvidia-ml.so

@klueska
Copy link
Contributor

klueska commented Mar 22, 2022

The newest version of nvidia-docker should resolve these issues with ldconfig not properly setting up the library search path on debian systems before a container gets launched.

Specifically this change in libnvidia-container fixes the issue and is included as part of the latest release:
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/141

The latest release packages for the full nvidia-docker stack:

libnvidia-container1-1.9.0
libnvidia-container-tools-1.9.0
nvidia-container-toolkit-1.9.0
nvidia-container-runtime-3.9.0
nvidia-docker-2.10.0

@wajeehulhassanvii
Copy link

Hello and sorry for the delay!

Executing processes in the container is a pretty dangerous operation and by default we use the host ldconfig (@/sbin/ldconfig.real), if it's not present there isn't any fallback to the container ldconfig unless you manually specify it (by replacing it to /sbin/ldconfig.real in the /etc/nvidia-container-runtime/config.toml file).

Hope this helps you!

This worked for me.

I am now getting this message.
image

@Forsworns
Copy link

Forsworns commented May 5, 2023

Revise the runtime config file does not help for me. Maybe I used the old version. But I use the following script to recreate the linked files to replace the missing/wrong ones manually.

You may visit the /usr/lib64 (for centos), or /usr/lib/x86_64-linux-gnu (for debian), to find the correct version. For me, it's 465.19.01.

for file in `find / -type f -name "*.so.465.19.01"`;
do
    prefix=$(expr match "$file" '\(.*\)\.so\.*')
    newlink=${prefix}.so.1
    echo "Creating soft link $newlink ..."
    ln -sf $file $newlink
done

Or a simple ldconfig in container helps

@Davidrjx
Copy link

Revise the runtime config file does not help for me. Maybe I used the old version. But I use the following script to recreate the linked files to replace the missing/wrong ones manually.

You may visit the /usr/lib64 (for centos), or /usr/lib/x86_64-linux-gnu (for debian), to find the correct version. For me, it's 465.19.01.

for file in `find / -type f -name "*.so.465.19.01"`;
do
    prefix=$(expr match "$file" '\(.*\)\.so\.*')
    newlink=${prefix}.so.1
    echo "Creating soft link $newlink ..."
    ln -sf $file $newlink
done

Or a simple ldconfig in container helps

seems not general that ldconfig does not help within container in debian10(buster) for my case

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests