-
Notifications
You must be signed in to change notification settings - Fork 2k
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system #1163
Comments
I'm hitting it as well on a very similar setup, i.e. Debian 10 Buster with kernel 5.3.9 from backports with identical version of nvidia-container* packages but different NVIDIA driver version 430.64. Also this issue seems to be a clone of #854 which is however closed without being resolved. The error actually seems to stem from a missing
This error is however logged to
This led me to finding another solution by looking into
Changing the
I am however pretty sure that the default has been working for me before with NVIDIA driver version 418.74 but I cannot confirm the driver version is the cause of problem here. |
that did indeed fix the problem for me. thanks for the help |
@markj24 I would leave this bug open until someone figures out why the defaults don't work. |
Hello and sorry for the delay! Executing processes in the container is a pretty dangerous operation and by default we use the host ldconfig ( Hope this helps you! |
Hi @RenaudWasTaken and thank you for your response. Does it mean that the host On Debian there's only Cheers! |
Yep that's probably what is happening, I'll take a deeper look later this week. |
Sorry it took me so long to get back to you but by default we should select the correct ldconfig on debian: https://github.com/NVIDIA/container-toolkit/blob/master/config/config.toml.debian You can also see that by extracting the tarball:
Feel free to reopen if you encounter on another machine! |
@lyon667 |
I am encountered the same problem.
Does that mean the ldconfig path will change on Debian in later release? |
@bingzhangdai Note that not using the |
@klueska Oh, thanks for pointing out. On my host (Debian 10),
According to the discussion above, I understand it is not good to execute |
Can you flip on the setting for:
Run the container again, and post the output of |
I enabled all the logs in the config file, but
|
Hmmm. This line seems odd if you say you have an
What is the output of the following on our host:
|
I am also wondering why the log says no such file or directory.
I think the comment above #1163 (comment) is similar. |
This is impacting the NVIDIA CUDA C++ Core Libraries team, on Debian unstable, using the |
With the config file as shipped, e.g. with
Changing it from
|
I also do not get |
(I can confirm I'm getting the same behavior as @brycelelbach.) |
For me, the nvidia docker with cuda 11.0 has the same behaviour as @brycelelbach . However I tried 10.2 base docker, it works just fine. |
@MingyaoLiu It is mainly because, changing from My workaround is puttiing ldconfig inside the docker image and changing from I am wondering if nvidia-dcoker team has plan to solve this problem. I would suggest someone reopen this issue, as so many developers are all encountered it. |
This is a duplicate of: #1399 That said, given that it works with 10.2 but not 11.0, I'm starting to think it may actually be related to: I will test this out in the next few days. |
@klueska on my Debian it still only works with version 10.2. The command |
I have same problem after installing cuda 11.2 from nvidia website. any solutions? |
Hi, I'm facing the same issue on my Debian 10 with nvidia/cuda:11.x-base. Is there any news concerning this issue ? Thank you for your help |
Did you try changing worked for me on debian bullseye |
please refer to this comment: #1163 (comment). we are expecting the better solution. |
hi, I don't have the sudo permit, is there any way to make it work? |
This issue popped up for me after upgrading from Debian 10 to Debian 11 and using the new nvidia-container-toolkit release. The fix on Debian 11 of dropping the "@" in the config file as suggested fixed it for me. |
worked for me on debian9 |
export LD_LIBRARY_PATH=/usr/local/nvidia/lib64/ |
The newest version of Specifically this change in The latest release packages for the full
|
This worked for me. |
Revise the runtime config file does not help for me. Maybe I used the old version. But I use the following script to recreate the linked files to replace the missing/wrong ones manually. You may visit the for file in `find / -type f -name "*.so.465.19.01"`;
do
prefix=$(expr match "$file" '\(.*\)\.so\.*')
newlink=${prefix}.so.1
echo "Creating soft link $newlink ..."
ln -sf $file $newlink
done Or a simple |
seems not general that ldconfig does not help within container in debian10(buster) for my case |
1. Issue or feature description
receive error NVIDIA-SMI couldn't find libnvidia-ml.so library in your system when running nvidia-smi within cointainer. i'm sure the driver is installed correctly as i get the correct output from nvidia-smi when run on the host. running ldconfig within the container corrects this temporarily until the container is updated
2. Steps to reproduce the issue
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
uname -a
Linux openmediavault.local 5.3.0-0.bpo.2-amd64 #1 SMP Debian 5.3.9-2~bpo10+1 (2019-11-13) x86_64 GNU/Linux
dmesg
nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Mon Dec 23 17:11:55 2019
Driver Version : 440.44
CUDA Version : 10.2
Attached GPUs : 1
GPU 00000000:83:00.0
Product Name : Quadro P2000
Product Brand : Quadro
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1422019086300
GPU UUID : GPU-67caad7d-2744-4ec8-7a48-e17278af1025
Minor Number : 0
VBIOS Version : 86.06.74.00.01
MultiGPU Board : No
Board ID : 0x8300
GPU Part Number : 900-5G410-1700-000
Inforom Version
Image Version : G410.0502.00.02
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x83
Device : 0x00
Domain : 0x0000
Device Id : 0x1C3010DE
Bus Id : 00000000:83:00.0
Sub System Id : 0x11B310DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 64 %
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 5059 MiB
Used : 0 MiB
Free : 5059 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Compute Mode : Default
Utilization
Gpu : 2 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Temperature
GPU Current Temp : 35 C
GPU Shutdown Temp : 104 C
GPU Slowdown Temp : 101 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 17.71 W
Power Limit : 75.00 W
Default Power Limit : 75.00 W
Enforced Power Limit : 75.00 W
Min Power Limit : 75.00 W
Max Power Limit : 75.00 W
Clocks
Graphics : 1075 MHz
SM : 1075 MHz
Memory : 3499 MHz
Video : 999 MHz
Applications Clocks
Graphics : 1075 MHz
Memory : 3504 MHz
Default Applications Clocks
Graphics : 1075 MHz
Memory : 3504 MHz
Max Clocks
Graphics : 1721 MHz
SM : 1721 MHz
Memory : 3504 MHz
Video : 1556 MHz
Max Customer Boost Clocks
Graphics : 1721 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
docker version
Client: Docker Engine - Community
Version: 19.03.5
API version: 1.40
Go version: go1.12.12
Git commit: 633a0ea838
Built: Wed Nov 13 07:25:38 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.5
API version: 1.40 (minimum version 1.12)
Go version: go1.12.12
Git commit: 633a0ea838
Built: Wed Nov 13 07:24:09 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.10
GitCommit: b34a5c8af56e510852c35414db4c1f4fa6172339
runc:
Version: 1.0.0-rc8+dev
GitCommit: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
docker-init:
Version: 0.18.0
GitCommit: fec3683
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
||/ Name Version Architecture Description
+++-=============================-============-============-=====================================================
ii libnvidia-container-tools 1.0.5-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.0.5-1 amd64 NVIDIA container runtime library
ii nvidia-container-runtime 3.1.4-1 amd64 NVIDIA container runtime
un nvidia-container-runtime-hook (no description available)
ii nvidia-container-toolkit 1.0.5-1 amd64 NVIDIA container runtime hook
nvidia-container-cli -V
version: 1.0.5
build date: 2019-09-06T16:59+00:00
build revision: 13b836390888f7b7c7dca115d16d7e28ab15a836
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
The text was updated successfully, but these errors were encountered: