Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: glibc search paths for nvidia #421

Merged
merged 1 commit into from
Jun 24, 2024

Conversation

frezbo
Copy link
Member

@frezbo frezbo commented Jun 24, 2024

Set glibc/lib as first rpath for nvidia-container-cli. Also install nvidia libraries to /usr/local/glibc/lib so any musl libraries lives separately.

nvidia-container-cli explicitly sets an RPATH as $ORIGIN/../$LIB here: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/blob/v1.14.6/Makefile?ref_type=tags#L183, this means /usr/local/lib would be searched first, since zfs and nvidia ship their own libtirpc, nvidia-container-cli first tries to use the libtirpc shippeed with zfs at /usr/local/lib instead of the one at /usr/local/glibc/lib. Fix this by setting an additional RPATH as $ORIGIN/../glibc/$LIB, so that libraries in /usr/local/glibc/lib have higher preference.

❯ scanelf -r _out/rootfs/rootfs/usr/local/bin/nvidia-container-cli
 TYPE   RPATH FILE
ET_DYN $ORIGIN/../glibc/$LIB:$ORIGIN/../$LIB _out/rootfs/rootfs/usr/local/bin/nvidia-container-cli

Properly fixes: #380

Fixes from #401 and #410 were not complete.

Manually tested by spinning up a NVIDIA worker in AWS.

@frezbo frezbo force-pushed the fix/nvidia-add-new-search-path branch from cafe267 to 24c76e0 Compare June 24, 2024 08:36
@@ -1,6 +1 @@
# libc default configuration
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to make sure, glibc doesn't try to load non-glibc libs

@@ -51,7 +51,7 @@ steps:
cd libnvidia-container

# LDLIBS=-L/usr/local/glibc/lib is set so that libnvidia-container-cli libs which are hardcoded as -llibname and not using pkg-config
CPPFLAGS="-I/usr/local/glibc/include/tirpc" LDLIBS="-L/usr/local/glibc/lib -ltirpc -lelf -lseccomp" make
CPPFLAGS="-I/usr/local/glibc/include/tirpc" LDLIBS="-L/usr/local/glibc/lib -ltirpc -lelf -lseccomp" LDFLAGS='-Wl,--rpath=\$$ORIGIN/../glibc/\$$LIB' make
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual fix needed

cd NVIDIA-Linux-*

./nvidia-installer --silent \
--opengl-prefix=/rootfs/usr/local \
--utility-prefix=/rootfs/usr/local \
--utility-libdir=glibc/lib \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some reason nvidia-container-cli can't find libs stored under /usr/local/lib so we explicitly keep them under the custom location at /usr/local/glibc/lib

Set `glibc/lib` as first `rpath` for `nvidia-container-cli`. Also
install nvidia libraries to `/usr/local/glibc/lib` so any musl libraries
lives separately.

`nvidia-container-cli` explicitly sets an `RPATH` as `$ORIGIN/../$LIB` here:
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/blob/v1.14.6/Makefile?ref_type=tags#L183,
this means `/usr/local/lib` would be searched first, since `zfs` and
nvidia ship their own `libtirpc`, `nvidia-container-cli` first tries to
use the `libtirpc` shippeed with `zfs` at `/usr/local/lib` instead of
the one at `/usr/local/glibc/lib`. Fix this by setting an additional
`RPATH` as `$ORIGIN/../glibc/$LIB`, so that libraries in
`/usr/local/glibc/lib` have higher preference.

```bash
❯ scanelf -r _out/rootfs/rootfs/usr/local/bin/nvidia-container-cli
 TYPE   RPATH FILE
ET_DYN $ORIGIN/../glibc/$LIB:$ORIGIN/../$LIB _out/rootfs/rootfs/usr/local/bin/nvidia-container-cli
```

Properly fixes: siderolabs#380

Fixes from siderolabs#401 and siderolabs#410 were not complete.

Manually tested by spinning up a NVIDIA worker in AWS.

Signed-off-by: Noel Georgi <git@frezbo.dev>
@frezbo frezbo force-pushed the fix/nvidia-add-new-search-path branch from 24c76e0 to 5334e89 Compare June 24, 2024 08:45
Copy link
Member

@utkuozdemir utkuozdemir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🆒

@frezbo
Copy link
Member Author

frezbo commented Jun 24, 2024

/m

@talos-bot talos-bot merged commit 5334e89 into siderolabs:main Jun 24, 2024
14 checks passed
@frezbo frezbo deleted the fix/nvidia-add-new-search-path branch June 24, 2024 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zfs breaks nvidia-container-toolkit
3 participants