Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buildkit loose the binding of the nvidia driver libraries available on the host #2117

Open
rafraph opened this issue May 20, 2021 · 6 comments

Comments

@rafraph
Copy link

rafraph commented May 20, 2021

Issue

When building a container that should use CUDA it should bind the nvidia driver libraries available on the host into the container.
It is working perfectly with legacy build. But with buildkit in some cases it loose the binding.

Explanation

The file that loose binding is the "libcuda.so" which is the CUDA driver library.
libcuda.so is a simlink to libcuda.so.1 which is a symlink to libcuda.so.version (in my case libcuda.so.455.32.00)
This issue cause the linking to fail with this error:

 /usr/lib/x86_64-linux-gnu/libcuda.so: file not recognized: File truncated
 collect2: error: ld returned 1 exit status

When prinitng the file size during the build, with the following command:
RUN ls -l /usr/lib/x86_64-linux-gnu/libcuda.so.455.32.00
With legacy build we get:
-rw-r--r-- 1 root root 21074296 Oct 14 2020 /usr/lib/x86_64-linux-gnu/libcuda.so.455.32.00
but with buildkit we get size of zero:
-rw-r--r-- 1 root root 0 Mar 1 15:18 /usr/lib/x86_64-linux-gnu/libcuda.so.455.32.00

Example

Dockerfile

FROM nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
RUN ls -l /usr/lib/x86_64-linux-gnu/libcuda.so.455.32.00
CMD ["/bin/bash"]

Command:
docker build --progress=plain -t test .

Result without buildkit:

Sending build context to Docker daemon  4.968GB
Step 1/3 : FROM nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
11.1-cudnn8-devel-ubuntu18.04: Pulling from nvidia/cuda
f22ccc0b8772: Already exists
3cf8fb62ba5f: Already exists
e80c964ece6a: Already exists
8a451ac89a87: Already exists
c563160b1f64: Already exists
596a46902202: Already exists
aa0805983180: Already exists
5718c3da35a0: Already exists
003637b0851a: Already exists
Digest: sha256:eaf9028c8becaaee2f0ad926fcd5edb80c7f937d58d6c5d069731f8f9afc2152
Status: Downloaded newer image for nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04
 ---> a12c244542fe
Step 2/3 : RUN ls -l /usr/lib/x86_64-linux-gnu/libcuda.so.455.32.00
 ---> Running in fb3bff4aebb4
-rw-r--r-- 1 root root 21074296 Oct 14  2020 /usr/lib/x86_64-linux-gnu/libcuda.so.455.32.00
Removing intermediate container fb3bff4aebb4
 ---> 5613878680ba
Step 3/3 : CMD ["/bin/bash"]
 ---> Running in 06022742780f
Removing intermediate container 06022742780f
 ---> 180b8e423aa1
Successfully built 180b8e423aa1
Successfully tagged test:latest

Result with buildkit:

#1 [internal] load build definition from Dockerfile.test
#1 transferring dockerfile: 170B done
#1 DONE 0.1s

#2 [internal] load .dockerignore
#2 transferring context: 2B done
#2 DONE 0.1s

#3 [internal] load metadata for docker.io/nvidia/cuda:11.1-cudnn8-devel-ubu...
#3 DONE 0.7s

#4 [1/2] FROM docker.io/nvidia/cuda:11.1-cudnn8-devel-ubuntu18.04@sha256:ea...
#4 CACHED

#5 [2/2] RUN ls -l /usr/lib/x86_64-linux-gnu/libcuda.so.455.32.00
#5 0.273 ls: cannot access '/usr/lib/x86_64-linux-gnu/libcuda.so.455.32.00': No such file or directory
#5 ERROR: executor failed running [/bin/sh -c ls -l /usr/lib/x86_64-linux-gnu/libcuda.so.455.32.00]: runc did not terminate sucessfully
------
 > [2/2] RUN ls -l /usr/lib/x86_64-linux-gnu/libcuda.so.455.32.00:
------
failed to solve with frontend dockerfile.v0: failed to build LLB: executor failed running [/bin/sh -c ls -l /usr/lib/x86_64-linux-gnu/libcuda.so.455.32.00]: runc did not terminate sucessfully
@tonistiigi
Copy link
Member

I'm not sure if you are reporting that you want to use nvidia runtime with buildkit or something else. Docker/Buildkit do not mount libraries from the host into the containers.

@rafraph
Copy link
Author

rafraph commented May 20, 2021

I use the nvidia/cuda image as a base image for a long time. There were no problems with the legacy build. Now I want to move to buildkit, but the above issue is happen.
From what I understand as ilustrated here and explained here the libnvidia-container takes the nvidia driver libraries available on the host and bind mounts them into the container.
Do you think this is a problem of the nvidia image?

@rafraph
Copy link
Author

rafraph commented May 30, 2021

@tonistiigi Do you think this is a problem of the nvidia image?

@KernelA
Copy link

KernelA commented Dec 2, 2021

I have the same issue. Probably, it relates with NVIDIA Container Toolkit.

At the moment, it is possible to disable buildkit and build image, but buildkit or NVIDIA Container Toolkit does not support each other at the build stage

@fnobis
Copy link

fnobis commented Feb 23, 2023

The same problem still seems to exists with the newest docker/buildkit version.

My docker file needs gpu support during docker build. The build process works when not using buildkit (DOCKER_BUILDKIT=0 ). With the buildkit, an error message pops up that nvidia libraries are not found. Is there any fix for this? Since I also get the message that non-buildkit building will be deprecated.

I found this interesting post NVIDIA/nvidia-docker#1268 (comment) on github about different nvidia docker integrations.
Could it be that this is related to the usage of nvidia-docker2. Is the kubernetes support working fine now with the newest version of nvidia-container-toolkit?

@crazy-max
Copy link
Member

The same problem still seems to exists with the newest docker/buildkit version.

My docker file needs gpu support during docker build. The build process works when not using buildkit (DOCKER_BUILDKIT=0 ). With the buildkit, an error message pops up that nvidia libraries are not found. Is there any fix for this? Since I also get the message that non-buildkit building will be deprecated.

I found this interesting post NVIDIA/nvidia-docker#1268 (comment) on github about different nvidia docker integrations. Could it be that this is related to the usage of nvidia-docker2. Is the kubernetes support working fine now with the newest version of nvidia-container-toolkit?

Might be related to #1436

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants