Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot set up cuda-compat driver when there's alternatives symlink #117

Closed
xkszltl opened this issue Nov 4, 2020 · 22 comments
Closed

Cannot set up cuda-compat driver when there's alternatives symlink #117

xkszltl opened this issue Nov 4, 2020 · 22 comments

Comments

@xkszltl
Copy link

xkszltl commented Nov 4, 2020

Host has CUDA 10.2 + 440 drv install, and I'm trying to use CUDA 11.1 + 455 drv in docker.
Surprisingly I found when I have cuda-toolkit-11-1 installed, cuda-compat-11-1 driver cannot be loaded and nvidia-smi shows 10.2 instead of 11.1.
Even with all dependencies of cuda-toolkit-11-1 installed, nvidia-smi still shows 11.1, and I found the bug is caused by update-alternatives in post-install script of cuda-toolkit-11-1.

% rpm --scripts -qlp ~/Downloads/cuda-toolkit-11-1-11.1.1-1.x86_64.rpm
warning: /Users/xkszltl/Downloads/cuda-toolkit-11-1-11.1.1-1.x86_64.rpm: Header V3 RSA/SHA512 Signature, key ID 7fa2af80: NOKEY
postuninstall scriptlet (using /bin/sh):
update-alternatives --remove cuda /usr/local/cuda-11.1
update-alternatives --remove cuda-11 /usr/local/cuda-11.1
posttrans scriptlet (using /bin/sh):
update-alternatives --install /usr/local/cuda cuda /usr/local/cuda-11.1 11
update-alternatives --install /usr/local/cuda-11 cuda-11 /usr/local/cuda-11.1 11
(contains no files)

Here's a dockerfile repro that mimics update-alternatives behavior with ln:

FROM nvidia/cuda:11.1-base-centos7
RUN ln -sfT /usr/local/cuda-11.1 /etc/alternatives/cuda
RUN ln -sfT /etc/alternatives/cuda /usr/local/cuda

With or without 3rd line, results are different.

Without 3rd line

# cat Dockerfile && sudo docker build --pull --no-cache -t cuda_jump . && sudo docker run --rm -it --gpus all cuda_jump nvidia-smi
FROM nvidia/cuda:11.1-base-centos7
RUN ln -sfT /usr/local/cuda-11.1 /etc/alternatives/cuda
# RUN ln -sfT /etc/alternatives/cuda /usr/local/cuda
Sending build context to Docker daemon  6.868MB
Step 1/2 : FROM nvidia/cuda:11.1-base-centos7
11.1-base-centos7: Pulling from nvidia/cuda
Digest: sha256:759a04c1d9e59cc894889b4edae4684b07ac2f7d20214edf7cf7a43a80f35d22
Status: Image is up to date for nvidia/cuda:11.1-base-centos7
 ---> 165de1193617
Step 2/2 : RUN ln -sfT /usr/local/cuda-11.1 /etc/alternatives/cuda
 ---> Running in ca0b189bce0f
Removing intermediate container ca0b189bce0f
 ---> fcc95533128a
Successfully built fcc95533128a
Successfully tagged cuda_jump:latest
Wed Nov  4 19:04:23 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00001446:00:00.0 Off |                    0 |
| N/A   41C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00002232:00:00.0 Off |                    0 |
| N/A   42C    P0    44W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 0000317B:00:00.0 Off |                    0 |
| N/A   40C    P0    43W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00004A18:00:00.0 Off |                    0 |
| N/A   42C    P0    44W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 000059F4:00:00.0 Off |                    0 |
| N/A   40C    P0    44W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00007C99:00:00.0 Off |                    0 |
| N/A   40C    P0    44W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 0000B32A:00:00.0 Off |                    0 |
| N/A   39C    P0    43W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 0000F250:00:00.0 Off |                    0 |
| N/A   40C    P0    46W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

With 3rd line

# cat Dockerfile && sudo docker build --pull --no-cache -t cuda_jump . && sudo docker run --rm -it --gpus all cuda_jump nvidia-smi
FROM nvidia/cuda:11.1-base-centos7
RUN ln -sfT /usr/local/cuda-11.1 /etc/alternatives/cuda
RUN ln -sfT /etc/alternatives/cuda /usr/local/cuda
Sending build context to Docker daemon  6.868MB
Step 1/3 : FROM nvidia/cuda:11.1-base-centos7
11.1-base-centos7: Pulling from nvidia/cuda
Digest: sha256:759a04c1d9e59cc894889b4edae4684b07ac2f7d20214edf7cf7a43a80f35d22
Status: Image is up to date for nvidia/cuda:11.1-base-centos7
 ---> 165de1193617
Step 2/3 : RUN ln -sfT /usr/local/cuda-11.1 /etc/alternatives/cuda
 ---> Running in 7894f6ff3d85
Removing intermediate container 7894f6ff3d85
 ---> feb25628ec57
Step 3/3 : RUN ln -sfT /etc/alternatives/cuda /usr/local/cuda
 ---> Running in e5792cbfc1b1
Removing intermediate container e5792cbfc1b1
 ---> 9de15a3f151c
Successfully built 9de15a3f151c
Successfully tagged cuda_jump:latest
Wed Nov  4 19:03:02 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00001446:00:00.0 Off |                    0 |
| N/A   40C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00002232:00:00.0 Off |                    0 |
| N/A   42C    P0    44W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 0000317B:00:00.0 Off |                    0 |
| N/A   40C    P0    43W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00004A18:00:00.0 Off |                    0 |
| N/A   42C    P0    44W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 000059F4:00:00.0 Off |                    0 |
| N/A   40C    P0    44W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00007C99:00:00.0 Off |                    0 |
| N/A   40C    P0    44W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 0000B32A:00:00.0 Off |                    0 |
| N/A   39C    P0    43W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 0000F250:00:00.0 Off |                    0 |
| N/A   40C    P0    46W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
@xkszltl
Copy link
Author

xkszltl commented Nov 6, 2020

Another strange behavior is symlink only works with rel path.
If I run ln -sfT /usr/local/cuda-11.1 /usr/local/cuda to overwrite the existing symlink, it breaks the cuda-compat driver loading.
This is a repro with Dockerfile

FROM nvidia/cuda:11.1-base-centos7
RUN ln -sfT /usr/local/cuda-11.1 /usr/local/cuda && ls /usr/local/cuda
# sudo docker build -t cuda_symlink_repro . && sudo docker run --rm -it --gpus all cuda_symlink_repro nvidia-smi
Sending build context to Docker daemon  4.272MB
Step 1/2 : FROM nvidia/cuda:11.1-base-centos7
 ---> 165de1193617
Step 2/2 : RUN ln -sfT /usr/local/cuda-11.1 /usr/local/cuda && ls /usr/local/cuda
 ---> Running in 2a44714ad6b1
compat
lib64
targets
Removing intermediate container 2a44714ad6b1
 ---> ff4ee1fe4249
Successfully built ff4ee1fe4249
Successfully tagged cuda_symlink_repro:latest
Fri Nov  6 11:58:56 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
...

But ln -sfT cuda-11.1 /usr/local/cuda from below won't hurt:

FROM nvidia/cuda:11.1-base-centos7
RUN ln -sfT cuda-11.1 /usr/local/cuda && ls /usr/local/cuda
# sudo docker build -t cuda_symlink_repro . && sudo docker run --rm -it --gpus all cuda_symlink_repro nvidia-smi
Sending build context to Docker daemon  4.272MB
Step 1/2 : FROM nvidia/cuda:11.1-base-centos7
 ---> 165de1193617
Step 2/2 : RUN ln -sfT cuda-11.1 /usr/local/cuda && ls /usr/local/cuda
 ---> Running in f6db1253ab35
compat
lib64
targets
Removing intermediate container f6db1253ab35
 ---> 621853331d81
Successfully built 621853331d81
Successfully tagged cuda_symlink_repro:latest
Fri Nov  6 12:02:18 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
...

@klueska
Copy link
Contributor

klueska commented Nov 6, 2020

Host has CUDA 10.2 + 440 drv install, and I'm trying to use CUDA 11.1 + 455 drv in docker.

What you are trying to do is not possible. There is no way to "install" the driver into the container because it is a kernel component. The driver from the host must be injected into the container.

@xkszltl
Copy link
Author

xkszltl commented Nov 6, 2020

@klueska
I'm not trying to install the "kernel driver", instead I'm trying to get CUDA 11.1 docker to work on CUDA 10.2 host, and the "driver" I talked about is the one it cuda-compat pkg (maybe driver is not the exact term for it, let me know what should I call it).

@xkszltl
Copy link
Author

xkszltl commented Nov 6, 2020

This is an officially supported usage: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cuda-compatibility-platform
The issue I mentioned is about nvidia docker failed to work when /usr/local/cuda and /usr/local/cuda-11.1 are

  • not directly symlinked
  • or symlinked by abs path

@klueska
Copy link
Contributor

klueska commented Nov 9, 2020

Since this seems to be an issue with something inside the container image / the packages being installed inside the container image (and not actually ibnvidia-container itself), I think you'll have more luck reporting it here:

https://forums.developer.nvidia.com/c/accelerated-computing/cuda/cuda-setup-and-installation/8

@xkszltl
Copy link
Author

xkszltl commented Nov 9, 2020

Since this seems to be an issue with something inside the container image / the packages being installed inside the container image (and not actually ibnvidia-container itself), I think you'll have more luck reporting it here:

https://forums.developer.nvidia.com/c/accelerated-computing/cuda/cuda-setup-and-installation/8

Image is just a collection of files, and if I understand it correctly, libnvidia-container is the one loading libcuda.so and libnvidia-ptxjitcompiler.so from the image.
From filesystem's point of view, below symlink chains are equivalent:

  • /usr/local/cuda -> cuda-11.1
  • /usr/local/cuda -> /usr/local/cuda-11.1
  • /usr/local/cuda -> /etc/alternatives/cuda -> /usr/local/cuda-11.1

But only the first one works.
Why this is not a bug in libnvidia-container?

@klueska
Copy link
Contributor

klueska commented Nov 10, 2020

I think there is some confusion.

libnvidia-container doesn't look at the container image at all.
It takes the nvidia driver libraries available on the host and bind mounts them into the container.

The set of driver libraries it bind mounts are all of the form libnvidia-*.so.<driver_version>. Once bind-mounted, it runs ldconfig over these libraries to autogenerate the libnvidia-*.so.1 symlinks to them inside the container.

As far as I understand it, you have both CUDA 10.2 and the nvidia 440 driver installed on the host. The CUDA installation on the host is actually irrelevant, as libnvidia-container will only inject libraries from the 440 driver. The full set of (possible) libraries can be seen here (https://github.com/NVIDIA/libnvidia-container/blob/master/src/nvc_info.c#L75).

With the container image you are using (i.e. nvidia/cuda:11.1-base-centos7), the libraries you will get injected are the utility_libs and the compute_libs from this list, i.e.:

static const char * const utility_libs[] = {
        "libnvidia-ml.so",                  /* Management library */
        "libnvidia-cfg.so",                 /* GPU configuration */
};

static const char * const compute_libs[] = {
        "libcuda.so",                       /* CUDA driver library */
        "libnvidia-opencl.so",              /* NVIDIA OpenCL ICD */
        "libnvidia-ptxjitcompiler.so",      /* PTX-SASS JIT compiler (used by libcuda) */
        "libnvidia-fatbinaryloader.so",     /* fatbin loader (used by libcuda) */
        "libnvidia-allocator.so",           /* NVIDIA allocator runtime library */
        "libnvidia-compiler.so",            /* NVVM-PTX compiler for OpenCL (used by libnvidia-opencl) */
};

This happens, because the container image sets the environment variable for NVIDIA_DRIVER_CAPABILITIES=compute,utility as described here: https://github.com/NVIDIA/nvidia-container-runtime#nvidia_driver_capabilities

If you install something inside the image that "overrides" these bind-mounted libraries from the host, then that is something completely out of libnvidia-containers control. It only takes what it sees on the host and injects it into the container -- it does nothing with what is installed inside the container image.

From what I understand, you are trying to install cuda-toolkit-11-1 and cuda-compat-11-1 inside the container (the latter of which will attempt to override the bind-mounted libraries for libcuda and libnvidia-ptxjitcompiler).

In theory, this should be possible, so long as you always run the container on a host with a 440+ driver installed (so you get the prerequisite 440 driver libraries bind-mounted into your container) and you ensure that the compat libraries "override" the ones injected from the 440 driver.

However, it is not up to libnvidia-container to ensure that this "override" is done properly inside the container. libnvidia-container simply injects what it sees on the host, and nothing more.

@klueska
Copy link
Contributor

klueska commented Nov 10, 2020

Now, if you were to install cuda-compat-11-1 on the host (and not in the container) and still had problems, then I would consider that an issue with libnvidia-container because it should pick up the "overridden" libraries on the host and inject those into the container instead of the ones from the 440 driver.

@xkszltl
Copy link
Author

xkszltl commented Nov 10, 2020

So if I understand correctly, the workflow is:

  1. libnvidia-container inject these from host to container
static const char * const compute_libs[] = {
        "libcuda.so",                       /* CUDA driver library */
        "libnvidia-opencl.so",              /* NVIDIA OpenCL ICD */
        "libnvidia-ptxjitcompiler.so",      /* PTX-SASS JIT compiler (used by libcuda) */
        "libnvidia-fatbinaryloader.so",     /* fatbin loader (used by libcuda) */
        "libnvidia-allocator.so",           /* NVIDIA allocator runtime library */
        "libnvidia-compiler.so",            /* NVVM-PTX compiler for OpenCL (used by libnvidia-opencl) */
};
  1. cuda-compat libs preinstalled in container somehow has higher priority than those in step 1, maybe via some ld configs
  2. libnvidia-container uses ldconfig to picked up a preferred one in container

Then probably there's something wrong in the last step, causing only rel path symlink to work.
Maybe the root seen by libnvidia-container is not the root in container?

@klueska
Copy link
Contributor

klueska commented Nov 10, 2020

As far as libnvidia-container is concerned, only step (1) happens.
After it's done injecting these libraries it disappears and docker runs exactly as it would if you were to have injected these libraries yourself with -v flags to the docker run call.

@xkszltl
Copy link
Author

xkszltl commented Nov 11, 2020

Thanks a lot for the detailed explanation.
I'll file a ticket to cuda forum as well.

Besides, could you clarify who's running ldconfig?

Once bind-mounted, it runs ldconfig over these libraries to autogenerate the libnvidia-*.so.1 symlinks to them inside the container.

Who's the "it"?
I thought "it" is libnvidia-container so I was talking about step 3. Maybe I misunderstand what you said.

@klueska
Copy link
Contributor

klueska commented Nov 11, 2020

Oh right, yes, libnvidia-container is the one who runs ldconfig from within the container.

And I actually misspoke before. I apologise.

linvidia-container does, in fact, inspect the container image for the compat libraries. It doesn't look at the image for any other libraries, but it special cases the compat libraries so that it can pull them up into standard library paths and make sure they are loaded when it makes its call to ldconfig.

I had forgotten about this detail, and apologise again for the confusion.

What this means, however, is that the issue you are reporting here is, in fact a bug in libnvidia-container.

libnvidia-container is hard coded to look under /usr/local/cuda/compat/ for the compatibility libraries. So long as this path only points through relative symlinks, libnvidia-container is able to resolve the libs contained underneath it. This is true, for example, when the symlink set up is /usr/local/cuda --> ./cuda-11.1.

However, libnvidia-container is not able to resolve absolute symlinks inside the container (e.g. /usr/local/cuda --> /etc/etc/alternatives/cuda) because it doesn't actually inspect the symlink and prepend the rootfs of wherever docker has unpacked the container image. This results in libnvidia-container thinking that /usr/local/cuda --> /etc/etc/alternatives/cuda is a broken symlink and stopping the search for the compat libraries down that path.

The relevant code is here:
https://github.com/NVIDIA/libnvidia-container/blob/master/src/nvc_container.c#L195

This ultimately results in libnvidia-container creating a blank file for the compat lib, rather than copying it over

You can see this in your (broken) example with:

# ls -la /usr/lib/x86_64-linux-gnu/libcuda.so*
lrwxrwxrwx 1 root root       12 Nov 11 10:12 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       18 Nov 11 10:31 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.440.33.01
-rw-r--r-- 1 root root 15672664 Feb  6  2019 /usr/lib/x86_64-linux-gnu/libcuda.so.440.33.01
-rw-r--r-- 1 root root        0 Nov 11 10:12 /usr/lib/x86_64-linux-gnu/libcuda.so.455.32.00

vs. the working example with:

# ls -la /usr/lib/x86_64-linux-gnu/libcuda.so*
# ls -la /usr/lib/x86_64-linux-gnu/libcuda.so*
lrwxrwxrwx 1 root root       12 Nov 11 10:33 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root       20 Nov 11 10:33 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.455.32.00
-rw-r--r-- 1 root root 15672664 Feb  6  2019 /usr/lib/x86_64-linux-gnu/libcuda.so.440.33.01
-rw-r--r-- 1 root root 21074296 Oct 14 22:58 /usr/lib/x86_64-linux-gnu/libcuda.so.455.32.00

Also, manually running ldconfig in the broken example will result in:

# ldconfig
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libcuda.so.455.32.00 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.455.32.00 is empty, not checked.

I don't have a good fix for this off of the top of my head, but at least we've gotten to the root of the problem, and identified that it is in fact an issue with libnvidia-container.

Thanks for pushing me to explain this in detail, or I never would have gotten to the root cause.

We will work on a fix for this in the next release.

@xkszltl
Copy link
Author

xkszltl commented Nov 11, 2020

Great, that matches my observations.
Without chroot, can the image use /usr/local/cuda -> ../../../../../../../<path to host file> to steal things from root?
Sounds like a security issue to me.

@klueska
Copy link
Contributor

klueska commented Nov 11, 2020

No, there appears to be a safeguard on .. paths terminating at the containers root.
See: https://github.com/NVIDIA/libnvidia-container/blob/master/src/utils.c#L806

@xkszltl
Copy link
Author

xkszltl commented Nov 11, 2020

Oh nice, you already have a resolver!
Then it should be very straight forward, just prepend (memcpy to a new buf) root around here:

@klueska
Copy link
Contributor

klueska commented Nov 11, 2020

The solution is actually this:

diff --git a/src/nvc_container.c b/src/nvc_container.c
index 825d9d3..21e3d00 100644
--- a/src/nvc_container.c
+++ b/src/nvc_container.c
@@ -184,6 +184,7 @@ find_namespace_path(struct error *err, const struct nvc_container *cnt, const ch
 static int
 find_library_paths(struct error *err, struct nvc_container *cnt)
 {
+        char path0[PATH_MAX];
         char path[PATH_MAX];
         glob_t gl;
         int rv = -1;
@@ -192,7 +193,9 @@ find_library_paths(struct error *err, struct nvc_container *cnt)
         if (!(cnt->flags & OPT_COMPUTE_LIBS))
                 return (0);

-        if (path_join(err, path, cnt->cfg.rootfs, cnt->cfg.cudart_dir) < 0)
+        if (path_resolve(err, path0, cnt->cfg.rootfs, cnt->cfg.cudart_dir) < 0)
+                return (-1);
+        if (path_join(err, path, cnt->cfg.rootfs, path0) < 0)
                 return (-1);
         if (path_append(err, path, "compat/lib*.so.*") < 0)
                 return (-1);

@klueska
Copy link
Contributor

klueska commented Dec 14, 2020

This has now been fixed and will be included in the 1.3.1 release of libnvidia-container coming out later this week:
3883db3

@klueska klueska closed this as completed Dec 14, 2020
@klueska
Copy link
Contributor

klueska commented Dec 15, 2020

libnvidia-container v1.3.1 has now been released. Please try it out and confirm that it resolves your problem.

@Davidrjx
Copy link

Davidrjx commented Jul 13, 2023

libnvidia-container v1.3.1 has now been released. Please try it out and confirm that it resolves your problem.

hi @klueska seems unresolved yet.
in my case, libnvidia-container1:amd64 with 1.12.0-1 installed on debian10(buster) and ldconfig from within container output like

(base) root@task000001-rjx:/output/workspace# ldconfig
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-ml.so.1 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libdxcore.so is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libcuda.so.1 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libcuda.so is empty, not checked.
/sbin/ldconfig.real: /lib/x86_64-linux-gnu/libcuda.so.1 is not a symbolic link

/sbin/ldconfig.real: /lib/x86_64-linux-gnu/libnvidia-ml.so.1 is not a symbolic link

@Davidrjx
Copy link

and also nvidia-toolkit.log from host output

-- WARNING, the following logs are for debugging purposes only --

I0712 15:58:44.534929 184795 nvc.c:376] initializing library context (version=1.12.0, build=7678e1af094d865441d0bc1b97c3e72d15fcab50)
I0712 15:58:44.535038 184795 nvc.c:350] using root /
I0712 15:58:44.535058 184795 nvc.c:351] using ldcache /etc/ld.so.cache
I0712 15:58:44.535076 184795 nvc.c:352] using unprivileged user 65534:65534
I0712 15:58:44.535116 184795 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0712 15:58:44.535432 184795 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I0712 15:58:44.542777 184802 nvc.c:278] loading kernel module nvidia
I0712 15:58:44.543098 184802 nvc.c:282] running mknod for /dev/nvidiactl
I0712 15:58:44.543196 184802 nvc.c:286] running mknod for /dev/nvidia0
I0712 15:58:44.543266 184802 nvc.c:286] running mknod for /dev/nvidia1
I0712 15:58:44.543335 184802 nvc.c:286] running mknod for /dev/nvidia2
I0712 15:58:44.543400 184802 nvc.c:286] running mknod for /dev/nvidia3
I0712 15:58:44.543465 184802 nvc.c:286] running mknod for /dev/nvidia4
I0712 15:58:44.543535 184802 nvc.c:286] running mknod for /dev/nvidia5
I0712 15:58:44.543607 184802 nvc.c:286] running mknod for /dev/nvidia6
I0712 15:58:44.543674 184802 nvc.c:286] running mknod for /dev/nvidia7
I0712 15:58:44.543737 184802 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I0712 15:58:44.555573 184802 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0712 15:58:44.555812 184802 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0712 15:58:44.564196 184802 nvc.c:296] loading kernel module nvidia_uvm
I0712 15:58:44.564265 184802 nvc.c:300] running mknod for /dev/nvidia-uvm
I0712 15:58:44.564406 184802 nvc.c:305] loading kernel module nvidia_modeset
I0712 15:58:44.564524 184802 nvc.c:309] running mknod for /dev/nvidia-modeset
I0712 15:58:44.565117 184803 rpc.c:71] starting driver rpc service
I0712 15:58:44.571962 184804 rpc.c:71] starting nvcgo rpc service
I0712 15:58:44.573356 184795 nvc_container.c:240] configuring container with 'compute utility supervised'
I0712 15:58:44.575906 184795 nvc_container.c:262] setting pid to 184789
I0712 15:58:44.575939 184795 nvc_container.c:263] setting rootfs to /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs
I0712 15:58:44.575959 184795 nvc_container.c:264] setting owner to 0:0
I0712 15:58:44.575977 184795 nvc_container.c:265] setting bins directory to /usr/bin
I0712 15:58:44.575995 184795 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu
I0712 15:58:44.576013 184795 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu
I0712 15:58:44.576030 184795 nvc_container.c:268] setting cudart directory to /usr/local/cuda
I0712 15:58:44.576048 184795 nvc_container.c:269] setting ldconfig to @/sbin/ldconfig (host relative)
I0712 15:58:44.576066 184795 nvc_container.c:270] setting mount namespace to /proc/184789/ns/mnt
I0712 15:58:44.576083 184795 nvc_container.c:272] detected cgroupv1
I0712 15:58:44.576101 184795 nvc_container.c:273] setting devices cgroup to /sys/fs/cgroup/devices/kubepods.slice/kubepods-pod2fbd3029_c5d3_4822_84f9_ea5470988e46.slice/cri-containerd-
6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79.scope
I0712 15:58:44.576127 184795 nvc_info.c:767] requesting driver information with ''
I0712 15:58:44.577930 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.525.78.01
I0712 15:58:44.578135 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.525.78.01
I0712 15:58:44.578285 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.525.78.01
I0712 15:58:44.578380 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.525.78.01
I0712 15:58:44.578483 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.525.78.01
I0712 15:58:44.578651 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.525.78.01
I0712 15:58:44.578800 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.525.78.01
I0712 15:58:44.578899 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.525.78.01
I0712 15:58:44.579046 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.525.78.01
I0712 15:58:44.579145 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.525.78.01
I0712 15:58:44.579294 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.525.78.01
I0712 15:58:44.579390 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.525.78.01
I0712 15:58:44.579484 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.525.78.01
I0712 15:58:44.579582 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.525.78.01
I0712 15:58:44.579730 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.525.78.01
I0712 15:58:44.579875 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.525.78.01
I0712 15:58:44.579975 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.525.78.01
I0712 15:58:44.580071 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.525.78.01
I0712 15:58:44.580221 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.525.78.01
I0712 15:58:44.580368 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.525.78.01
I0712 15:58:44.580656 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libcudadebugger.so.525.78.01
I0712 15:58:44.580769 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.525.78.01
I0712 15:58:44.581008 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.525.78.01
I0712 15:58:44.581108 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.525.78.01
I0712 15:58:44.581203 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.525.78.01
I0712 15:58:44.581306 184795 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.525.78.01
I0712 15:58:44.581409 184795 nvc_info.c:174] selecting /usr/lib32/vdpau/libvdpau_nvidia.so.525.78.01
I0712 15:58:44.581516 184795 nvc_info.c:174] selecting /usr/lib32/libnvidia-tls.so.525.78.01
I0712 15:58:44.581603 184795 nvc_info.c:174] selecting /usr/lib32/libnvidia-ptxjitcompiler.so.525.78.01
I0712 15:58:44.581738 184795 nvc_info.c:174] selecting /usr/lib32/libnvidia-opticalflow.so.525.78.01
I0712 15:58:44.581870 184795 nvc_info.c:174] selecting /usr/lib32/libnvidia-opencl.so.525.78.01
I0712 15:58:44.581958 184795 nvc_info.c:174] selecting /usr/lib32/libnvidia-nvvm.so.525.78.01
I0712 15:58:44.582089 184795 nvc_info.c:174] selecting /usr/lib32/libnvidia-ml.so.525.78.01
I0712 15:58:44.582218 184795 nvc_info.c:174] selecting /usr/lib32/libnvidia-glvkspirv.so.525.78.01
I0712 15:58:44.582304 184795 nvc_info.c:174] selecting /usr/lib32/libnvidia-glsi.so.525.78.01
I0712 15:58:44.582390 184795 nvc_info.c:174] selecting /usr/lib32/libnvidia-glcore.so.525.78.01
I0712 15:58:44.582479 184795 nvc_info.c:174] selecting /usr/lib32/libnvidia-fbc.so.525.78.01
I0712 15:58:44.582607 184795 nvc_info.c:174] selecting /usr/lib32/libnvidia-encode.so.525.78.01
I0712 15:58:44.582736 184795 nvc_info.c:174] selecting /usr/lib32/libnvidia-eglcore.so.525.78.01
I0712 15:58:44.582822 184795 nvc_info.c:174] selecting /usr/lib32/libnvidia-compiler.so.525.78.01
I0712 15:58:44.582912 184795 nvc_info.c:174] selecting /usr/lib32/libnvidia-allocator.so.525.78.01
I0712 15:58:44.583042 184795 nvc_info.c:174] selecting /usr/lib32/libnvcuvid.so.525.78.01
I0712 15:58:44.583190 184795 nvc_info.c:174] selecting /usr/lib32/libcuda.so.525.78.01
I0712 15:58:44.583331 184795 nvc_info.c:174] selecting /usr/lib32/libGLX_nvidia.so.525.78.01
I0712 15:58:44.583420 184795 nvc_info.c:174] selecting /usr/lib32/libGLESv2_nvidia.so.525.78.01
I0712 15:58:44.583515 184795 nvc_info.c:174] selecting /usr/lib32/libGLESv1_CM_nvidia.so.525.78.01
I0712 15:58:44.583601 184795 nvc_info.c:174] selecting /usr/lib32/libEGL_nvidia.so.525.78.01
W0712 15:58:44.583641 184795 nvc_info.c:400] missing library libnvidia-nscq.so
W0712 15:58:44.583660 184795 nvc_info.c:400] missing library libnvidia-fatbinaryloader.so
W0712 15:58:44.583678 184795 nvc_info.c:400] missing library libnvidia-pkcs11.so
W0712 15:58:44.583696 184795 nvc_info.c:400] missing library libnvidia-ifr.so
W0712 15:58:44.583714 184795 nvc_info.c:400] missing library libnvidia-cbl.so
W0712 15:58:44.583732 184795 nvc_info.c:404] missing compat32 library libnvidia-cfg.so
W0712 15:58:44.583750 184795 nvc_info.c:404] missing compat32 library libnvidia-nscq.so
W0712 15:58:44.583768 184795 nvc_info.c:404] missing compat32 library libcudadebugger.so
W0712 15:58:44.583786 184795 nvc_info.c:404] missing compat32 library libnvidia-fatbinaryloader.so
W0712 15:58:44.583804 184795 nvc_info.c:404] missing compat32 library libnvidia-pkcs11.so
W0712 15:58:44.583822 184795 nvc_info.c:404] missing compat32 library libnvidia-ngx.so
W0712 15:58:44.583840 184795 nvc_info.c:404] missing compat32 library libnvidia-ifr.so
W0712 15:58:44.583857 184795 nvc_info.c:404] missing compat32 library libnvidia-rtcore.so
W0712 15:58:44.583875 184795 nvc_info.c:404] missing compat32 library libnvoptix.so
W0712 15:58:44.583894 184795 nvc_info.c:404] missing compat32 library libnvidia-cbl.so
I0712 15:58:44.585063 184795 nvc_info.c:300] selecting /usr/bin/nvidia-smi
I0712 15:58:44.585133 184795 nvc_info.c:300] selecting /usr/bin/nvidia-debugdump
I0712 15:58:44.585198 184795 nvc_info.c:300] selecting /usr/bin/nvidia-persistenced
I0712 15:58:44.585303 184795 nvc_info.c:300] selecting /usr/bin/nvidia-cuda-mps-control
I0712 15:58:44.585368 184795 nvc_info.c:300] selecting /usr/bin/nvidia-cuda-mps-server
W0712 15:58:44.585499 184795 nvc_info.c:426] missing binary nv-fabricmanager
W0712 15:58:44.585590 184795 nvc_info.c:350] missing firmware path /lib/firmware/nvidia/525.78.01/gsp.bin
I0712 15:58:44.585672 184795 nvc_info.c:530] listing device /dev/nvidiactl
I0712 15:58:44.585691 184795 nvc_info.c:530] listing device /dev/nvidia-uvm
I0712 15:58:44.585709 184795 nvc_info.c:530] listing device /dev/nvidia-uvm-tools
I0712 15:58:44.585727 184795 nvc_info.c:530] listing device /dev/nvidia-modeset
W0712 15:58:44.585802 184795 nvc_info.c:350] missing ipc path /var/run/nvidia-persistenced/socket
W0712 15:58:44.585870 184795 nvc_info.c:350] missing ipc path /var/run/nvidia-fabricmanager/socket
W0712 15:58:44.585921 184795 nvc_info.c:350] missing ipc path /tmp/nvidia-mps
I0712 15:58:44.585939 184795 nvc_info.c:823] requesting device information with ''
I0712 15:58:44.594395 184795 nvc_info.c:714] listing device /dev/nvidia2 (GPU-0aa01954-d0c5-fbf3-c29c-a05b4e811f07 at 00000000:01:00.0)
I0712 15:58:44.601933 184795 nvc_info.c:714] listing device /dev/nvidia3 (GPU-bc63c41b-20e3-859b-ffe7-b8c9e41ec1fb at 00000000:25:00.0)
I0712 15:58:44.609552 184795 nvc_info.c:714] listing device /dev/nvidia1 (GPU-295ccbe3-848d-e8b7-4f60-7c52b80178a2 at 00000000:41:00.0)
I0712 15:58:44.616663 184795 nvc_info.c:714] listing device /dev/nvidia0 (GPU-233d9618-8654-bfca-1396-f8be8400d533 at 00000000:61:00.0)
I0712 15:58:44.624369 184795 nvc_info.c:714] listing device /dev/nvidia6 (GPU-d3ea9d3f-ce6c-33ff-31a2-be503a31725f at 00000000:81:00.0)
I0712 15:58:44.632248 184795 nvc_info.c:714] listing device /dev/nvidia7 (GPU-e2c51946-8ac3-931b-6483-0c02fd41ce7c at 00000000:a1:00.0)
I0712 15:58:44.640023 184795 nvc_info.c:714] listing device /dev/nvidia5 (GPU-8e05d48c-d3fa-88c2-c4b1-d56719ca3903 at 00000000:c1:00.0)
I0712 15:58:44.648859 184795 nvc_info.c:714] listing device /dev/nvidia4 (GPU-0206ee55-ba83-a95c-3cab-b112b6af2a37 at 00000000:e1:00.0)
I0712 15:58:44.649017 184795 nvc_mount.c:366] mounting tmpfs at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/ro
otfs/proc/driver/nvidia
E0712 15:58:44.649570 184795 utils.c:529] The path /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs/usr/bin
alreay exists with the required mode; skipping create
I0712 15:58:44.649648 184795 nvc_mount.c:134] mounting /usr/bin/nvidia-smi at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f
0fe9ff75f79/rootfs/usr/bin/nvidia-smi
I0712 15:58:44.649967 184795 nvc_mount.c:134] mounting /usr/bin/nvidia-debugdump at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b
9e010f0fe9ff75f79/rootfs/usr/bin/nvidia-debugdump
I0712 15:58:44.650109 184795 nvc_mount.c:134] mounting /usr/bin/nvidia-persistenced at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b15
36b9e010f0fe9ff75f79/rootfs/usr/bin/nvidia-persistenced
I0712 15:58:44.650247 184795 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-control at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c26
7b1536b9e010f0fe9ff75f79/rootfs/usr/bin/nvidia-cuda-mps-control
I0712 15:58:44.650377 184795 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-server at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267
b1536b9e010f0fe9ff75f79/rootfs/usr/bin/nvidia-cuda-mps-server
E0712 15:58:44.650541 184795 utils.c:529] The path /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs/usr/lib/
x86_64-linux-gnu alreay exists with the required mode; skipping create
I0712 15:58:44.650760 184795 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.525.78.01 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3
281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.525.78.01
I0712 15:58:44.650904 184795 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.525.78.01 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd
3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.525.78.01
I0712 15:58:44.651036 184795 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.525.78.01 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bd
c4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs/usr/lib/x86_64-linux-gnu/libcuda.so.525.78.01
I0712 15:58:44.651168 184795 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libcudadebugger.so.525.78.01 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1f
cd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs/usr/lib/x86_64-linux-gnu/libcudadebugger.so.525.78.01
I0712 15:58:44.651301 184795 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.525.78.01 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1
fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.525.78.01
I0712 15:58:44.651443 184795 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.525.78.01 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648
f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.525.78.01
I0712 15:58:44.651586 184795 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.525.78.01 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f1
1d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.525.78.01
I0712 15:58:44.651719 184795 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.525.78.01 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11
d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.525.78.01
I0712 15:58:44.651850 184795 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.525.78.01 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fc
d3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.525.78.01
I0712 15:58:44.651929 184795 nvc_mount.c:527] creating symlink /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/roo
tfs/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
E0712 15:58:44.652035 184795 utils.c:529] The path /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs/dev/nvid
iactl alreay exists with the required mode; skipping create
I0712 15:58:44.652054 184795 nvc_mount.c:230] mounting /dev/nvidiactl at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9f
f75f79/rootfs/dev/nvidiactl
I0712 15:58:44.652500 184795 nvc_mount.c:230] mounting /dev/nvidia-uvm at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9
ff75f79/rootfs/dev/nvidia-uvm
I0712 15:58:44.652794 184795 nvc_mount.c:230] mounting /dev/nvidia-uvm-tools at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e01
0f0fe9ff75f79/rootfs/dev/nvidia-uvm-tools
E0712 15:58:44.653125 184795 utils.c:529] The path /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs/dev/nvid
ia4 alreay exists with the required mode; skipping create
I0712 15:58:44.653149 184795 nvc_mount.c:230] mounting /dev/nvidia4 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff7
5f79/rootfs/dev/nvidia4
I0712 15:58:44.653364 184795 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:e1:00.0 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56
d8c267b1536b9e010f0fe9ff75f79/rootfs/proc/driver/nvidia/gpus/0000:e1:00.0
E0712 15:58:44.653698 184795 utils.c:529] The path /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff75f79/rootfs/dev/nvid
ia2 alreay exists with the required mode; skipping create
I0712 15:58:44.653720 184795 nvc_mount.c:230] mounting /dev/nvidia2 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b1536b9e010f0fe9ff7
5f79/rootfs/dev/nvidia2
I0712 15:58:44.653926 184795 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:01:00.0 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56
d8c267b1536b9e010f0fe9ff75f79/rootfs/proc/driver/nvidia/gpus/0000:01:00.0
I0712 15:58:44.654204 184795 nvc_ldcache.c:380] executing /sbin/ldconfig from host at /run/containerd/io.containerd.runtime.v2.task/k8s.io/6ce2aa648f86f11d1fcd3281bdc4a60eb56d8c267b153
6b9e010f0fe9ff75f79/rootfs
W0712 15:58:44.669243 184795 utils.c:121] /sbin/ldconfig: File /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 is empty, not checked.
W0712 15:58:44.669856 184795 utils.c:121] /sbin/ldconfig: File /usr/lib/x86_64-linux-gnu/libdxcore.so is empty, not checked.
W0712 15:58:44.670092 184795 utils.c:121] /sbin/ldconfig: File /usr/lib/x86_64-linux-gnu/libcuda.so.1 is empty, not checked.
W0712 15:58:44.670264 184795 utils.c:121] /sbin/ldconfig: File /usr/lib/x86_64-linux-gnu/libcuda.so is empty, not checked.
W0712 15:58:44.686872 184795 utils.c:121] /sbin/ldconfig: /usr/lib/x86_64-linux-gnu/libcuda.so.1 is not a symbolic link
W0712 15:58:44.686910 184795 utils.c:121]
W0712 15:58:44.687044 184795 utils.c:121] /sbin/ldconfig: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 is not a symbolic link
W0712 15:58:44.687079 184795 utils.c:121]
I0712 15:58:44.733141 184795 nvc.c:434] shutting down library context
I0712 15:58:44.733330 184804 rpc.c:95] terminating nvcgo rpc service
I0712 15:58:44.734373 184795 rpc.c:135] nvcgo rpc service terminated successfully
I0712 15:58:44.738402 184803 rpc.c:95] terminating driver rpc service
I0712 15:58:44.738727 184795 rpc.c:135] driver rpc service terminated successfully

@elezar
Copy link
Member

elezar commented Jul 13, 2023

@Davidrjx could you check whether the .so.1 files that are being mentioned are in the the docker image that you're trying to run?

If the nvidia-container-runtime is set as the default runtime in docker and used to build docker images, we have seen zero-size files present in the resulting images. Deleting these in the image before running it should address the issues.

@Davidrjx
Copy link

Davidrjx commented Jul 13, 2023

@Davidrjx could you check whether the .so.1 files that are being mentioned are in the the docker image that you're trying to run?

If the nvidia-container-runtime is set as the default runtime in docker and used to build docker images, we have seen zero-size files present in the resulting images. Deleting these in the image before running it should address the issues.

search the container run by image built on WSL, found a few files with .so.1 as postfix and seem weird, like

/usr/lib/wsl/drivers/nvddig.inf_amd64_75655a2ea4e639cb/libnvidia-ml.so.1
/usr/lib/wsl/drivers/nvddig.inf_amd64_75655a2ea4e639cb/libnvidia-ptxjitcompiler.so.1
/usr/lib/wsl/drivers/nvddig.inf_amd64_75655a2ea4e639cb/libcuda.so.1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants