-
Notifications
You must be signed in to change notification settings - Fork 2k
Couldn't find libnvidia-ml.so library in your system #854
Comments
Do I have to install any other libraries apart from the nvidia drivers on the host machine ? |
Bumping up. I am having exactly the same problem, also on Ubuntu 18.10 and driver version 390.87. |
I have similar symptoms but I can run
|
@symmsaur Can you import TensorFlow in the docker after ldconfig ? |
Mmm, given the symptoms, you are probably stumbling in the issue that was fixed by this commit: |
Thanks for the information. I will wait for the next release or compile libnvidia-container myself if I can't resolve it until then. Running ldconfig manually helped though ! Many thanks to @symmsaur... Best |
The same for me. docker run --runtime=nvidia --rm nvidia/cuda:9.0-base ldconfig && nvidia-smi works, without ldconfig fails with the same error. Running ldconfig inside the container fixes any issue with failing to resolve .so libraries (actually resolving on the host system): so tensorflow(image nvcr.io/nvidia/tensorflow:18.09-py3) imports and runs fine after that. Confirm that the above mentioned commit fixes the problem: re-compiled the latest master branch and replaced the library in my system path - now ngc tensorflow image works out of the box on ubuntu 18.10, nvidia driver 415.25 |
@ddurnev you probably ran nvidia-smi on the host (compare the Processes list inside and out of the container) |
@lccro Yes, you're right - this runs only the first part "ldconfig" inside the container, correct is smth like:
Still running nvidia-smi works without ldconfig only after the patch for libnvidia-container is applied. |
I have the same problem on Fedora 29, with nvidia 415 driver and nvidia-docker 2.0.3
But on host it works well:
Additonal informations about nvidia card:
More infoirmation about nvidia docker:
I have followed this guide for nvidia driver installation process. |
@botalaszlo I have the same problem on Fedora 29 after Running Does this work for you?
|
@andyneff Perfect! This works fine.
Maybe the documentation should be updated with your notice :) |
@botalaszlo It's not a documentation bug, you shouldn't have to run |
I just found out today the hard way that this bug affects more than just nvidia stuff.
This breaks anything in @flx42 Any idea when the next release will be? |
This should be fixed with the latest version of the libnvidia-container packages. |
Tested on Fedora 29, updated
After the update, confirmed fixed! Thanks @RenaudWasTaken |
I'm still experiencing this bug, running ldconfig makes nvidia-smi work.
How can make it work without running ldconfig first? |
Just supplying more (possibly useless) info. Still working on Fedora:
Test
|
@edoardogiacomello What is your current version of ld ? |
on the host I got: GNU ld (GNU Binutils for Ubuntu) 2.30 |
Yep, same issue here with the latest version:
So yeah, still broken:
|
yes -- i'm also stumbling over this bug on debian testing:
vs.
the |
Hi "nvidia", |
Works for me on Ubuntu 18.04 and Debian 10. Here's a run from scratch on Debian 10 (looks the same for me on Ubuntu as well), without ever manually doing ldconfig. I'm using nvidia-container-toolkit and have removed the old nvidia-docker2:
I recommend uninstalling and re-installing the driver and packages. it's possible your host system is in a strange state and it's impacting something in your setup. @glennie |
@glennie @mash-graz @Brainiarc7 do any of you get same result if you use nvjmayo's exact same sha?
|
your cmd line produces this error message on my machine:
using the
and manually adding
but i should perhaps mention, that i do not use the then nvida-drivers on this machine for the actual video output. i prefer the utilize the onboard intel chip for this purpose, because i'm otherwise not able to share the graphic card by PCIe--passthrough by qemu-kvm instances and mostly need the the nvidia card only for CUDA based GPGPU stuff. therefore the setup could slightly differ from other installations. |
Hello, Hello,
May be I'm missing something here... Why are you (@nvjmayo) using --runtime option? I used --gpus all (as I've got docker 19.03.2). Using the sha256 specified by @andyneff with
But, it works when I use ldconfig before:
|
My mistake, I have multiple runtimes installed for a bunch of different environments (both for docker and podman). I should have pasted the canonical form. Sorry for the confusion.
I'll ask the team to bump up the priority on fixing this. It's an issue of at what stage to run the container hooks. Automatically running ldconfig when needed is something we're looking into. When to do it, what mechanism to use to do it, and if we should stop a running container are all open questions for an implementation. The best way to work around the issue right now is to run ldconfig on the container whenever you upgrade your host driver. Admittedly inconvenient. |
Hello! Can you give us a few more information?
Thanks! |
i hope, that helps! btw.: i'm using |
Can you try replacing |
is the reason resp. actual meaning of this "@"-syntax used in https://gitlab.com/nvidia/container-toolkit/toolkit/blob/master/config/config.toml.debian somewhere documented or explained? |
Thank you, works for me. I am using Debian Testing (same as @mash-graz). |
Thank you @lyon667. It worked for me as well after wasting many hours of my time. Why does this work @RenaudWasTaken? |
I did not find any documentation but it seems to be processed here in |
this solved it for me. |
yes! -- this manual removal of the i also don't understand, why this particular issue still isn't fixed in the released nvida-docker packages and still affects debian installations? |
Had the same problem on a fresh Debian 10 install (openmediavault 5.6.2). Solved the problem for me too. Do we have a patch here or is this on Debian side? |
my error was solved by doing this way This is the answer : This led me to finding another solution by looking into /etc/nvidia-container-runtime/config.toml file where the ldconfig is by default set to "@/sbin/ldconfig". This for some reason seems to not be working and also produces the error above:
Changing the ldconfig path to "/sbin/ldconfig" (instead of "@/sbin/ldconfig") does indeed fix the problem:
|
The newest version of Specifically this change in The latest release packages for the full
|
If the above works then follow this article, set no-cgroups = false and ldconfig = "/sbin/ldconfig" in /etc/nvidia-container-runtime/config.toml, hopefully it will solve the problem. Worked for me. |
After HOURS of wasted time I found out the problem is Docker being installed from |
1. Issue or feature description
Missing libnvidia-ml.so and libcublas.9.so library in docker container.
My system is Ubuntu 18.10 and I tried with nvidia drivers 390, 396 and 410.
2. Steps to reproduce the issue
This also holds for the tensorflow docker images. When I run the cuda image in interactive mode and try to import tensorflow via python it says that libcublas.9.so is not found although I can see it in the /usr/local/cuda/lib64 directory.
Everything works fine on host machine though.
3. Information to attach (optional if deemed irrelevant)
uname -a
dmesg
nvidia-smi -a
docker version
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
nvidia-container-cli -V
The text was updated successfully, but these errors were encountered: