-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot set up cuda-compat driver when there's alternatives symlink #117
Comments
Another strange behavior is symlink only works with rel path.
But
|
What you are trying to do is not possible. There is no way to "install" the driver into the container because it is a kernel component. The driver from the host must be injected into the container. |
@klueska |
This is an officially supported usage: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cuda-compatibility-platform
|
Since this seems to be an issue with something inside the container image / the packages being installed inside the container image (and not actually https://forums.developer.nvidia.com/c/accelerated-computing/cuda/cuda-setup-and-installation/8 |
Image is just a collection of files, and if I understand it correctly, libnvidia-container is the one loading
But only the first one works. |
I think there is some confusion.
The set of driver libraries it bind mounts are all of the form As far as I understand it, you have both CUDA 10.2 and the nvidia 440 driver installed on the host. The CUDA installation on the host is actually irrelevant, as With the container image you are using (i.e.
This happens, because the container image sets the environment variable for If you install something inside the image that "overrides" these bind-mounted libraries from the host, then that is something completely out of From what I understand, you are trying to install In theory, this should be possible, so long as you always run the container on a host with a 440+ driver installed (so you get the prerequisite 440 driver libraries bind-mounted into your container) and you ensure that the compat libraries "override" the ones injected from the 440 driver. However, it is not up to |
Now, if you were to install |
So if I understand correctly, the workflow is:
Then probably there's something wrong in the last step, causing only rel path symlink to work. |
As far as |
Thanks a lot for the detailed explanation. Besides, could you clarify who's running ldconfig?
Who's the "it"? |
Oh right, yes, And I actually misspoke before. I apologise.
I had forgotten about this detail, and apologise again for the confusion. What this means, however, is that the issue you are reporting here is, in fact a bug in
However, The relevant code is here: This ultimately results in You can see this in your (broken) example with:
vs. the working example with:
Also, manually running
I don't have a good fix for this off of the top of my head, but at least we've gotten to the root of the problem, and identified that it is in fact an issue with Thanks for pushing me to explain this in detail, or I never would have gotten to the root cause. We will work on a fix for this in the next release. |
Great, that matches my observations. |
No, there appears to be a safeguard on |
Oh nice, you already have a resolver! libnvidia-container/src/utils.c Line 830 in 16315eb
|
The solution is actually this:
|
This has now been fixed and will be included in the 1.3.1 release of |
|
hi @klueska seems unresolved yet.
|
and also nvidia-toolkit.log from host output
|
@Davidrjx could you check whether the If the |
search the container run by image built on WSL, found a few files with .so.1 as postfix and seem weird, like
|
Host has CUDA 10.2 + 440 drv install, and I'm trying to use CUDA 11.1 + 455 drv in docker.
Surprisingly I found when I have
cuda-toolkit-11-1
installed,cuda-compat-11-1
driver cannot be loaded and nvidia-smi shows 10.2 instead of 11.1.Even with all dependencies of
cuda-toolkit-11-1
installed, nvidia-smi still shows 11.1, and I found the bug is caused byupdate-alternatives
in post-install script ofcuda-toolkit-11-1
.Here's a dockerfile repro that mimics
update-alternatives
behavior withln
:With or without 3rd line, results are different.
Without 3rd line
With 3rd line
The text was updated successfully, but these errors were encountered: