-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs breaks nvidia-container-toolkit #380
Comments
This is interesting though, zfs should be linked against the musl libc, so i guess the zfs extensions ship some extra libc, which it shouldn't |
This is due to both nvidia and zfs needing libtirpc and the zfs libtirpc is linked against musl while the nvidia one is linked against glibc. These two extensions are mutually incompatible with each other for now |
- Disable stargz-snapshotter: siderolabs/extensions#245 - Disable zfs: siderolabs/extensions#380
- Disable stargz-snapshotter: siderolabs/extensions#245 - Disable zfs: siderolabs/extensions#380
As heavy zfs users and soon-to-be nvidia extension users as well we would be glad providing assistance to help solving this one. @frezbo what do you think would be the best way to fix this then ? Build and use glibc in the zfs extension build as well (based on https://github.com/siderolabs/extensions/blob/main/nvidia-gpu/nvidia-container-toolkit/glibc/pkg.yaml for instance) instead of musl ? Or somehow trick the build process of https://github.com/siderolabs/extensions/blob/main/nvidia-gpu/nvidia-container-toolkit/nvidia-container-cli/libtirpc/pkg.yaml to use musl instead of the glibc ? From the libtirpc source package, the autoconf snippet handling libc detection is located in in
|
We do not want to do this, makes it a crazy hard to follow build steps
Nvidia has strict requirement for glibc and musl will not work at all What we were discussing internally is to setup some |
Looks like a lot of hassle Indeed...
Yeah... i'm not so surprised about that.
Like an alternate $LD_LIBRARY_PATH ? What about using https://github.com/NixOS/patchelf ? See https://stackoverflow.com/a/44710599 or https://www.baeldung.com/linux/multiple-glibc |
I want to avoid patchelf, at some point the whole nvidia thing was using a lot of patchelf to fix stuff, it just makes it harder for others and for us internally to keep things up to date, I'll be looking at this issue next week and see what would work great for us and easy to maintain, don't want to make it more complicated |
Fair enough :-) |
Use a custom path for libtirpc shipped with zfs-tools so that it doesn't conflict with libtirpc built for nvidia-container-toolkit (as it's linked against glibc). Fixes: siderolabs#380 Signed-off-by: Noel Georgi <git@frezbo.dev>
Use a custom path for libtirpc shipped with zfs-tools so that it doesn't conflict with libtirpc built for nvidia-container-toolkit (as it's linked against glibc). Fixes: siderolabs#380 Signed-off-by: Noel Georgi <git@frezbo.dev>
@frezbo Will this fix be possible on 1.7 or will we have to wait for 1.8? |
Sorry, probably only for 1.8, found other issues and fixing them, don't want to backport such major fixes |
Set `glibc/lib` as first `rpath` for `nvidia-container-cli`. Also install nvidia libraries to `/usr/local/glibc/lib` so any musl libraries lives separately. Properly fixes: siderolabs#380 Fixes from siderolabs#401 and siderolabs#410 were not complete. Manually tested by spinning up a NVIDIA worker in AWS. Signed-off-by: Noel Georgi <git@frezbo.dev>
Set `glibc/lib` as first `rpath` for `nvidia-container-cli`. Also install nvidia libraries to `/usr/local/glibc/lib` so any musl libraries lives separately. `nvidia-container-cli` explicitly sets an `RPATH` `$ORIGIN/../$LIB` here: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/blob/v1.14.6/Makefile?ref_type=tags#L183, this means `/usr/local/lib` would be searched first, since `zfs` and nvidia ship their own `libtirpc`, `nvidia-container-cli` first tries to use the `libtirpc` shippeed with `zfs` at `/usr/local/lib` instead of the one at `/usr/local/glibc/lib`. Fix this by setting an additional `RPATH` as `$ORIGIN/../glibc/$LIB`, so that libraries in `/usr/local/glibc/lib` have higher preference. Properly fixes: siderolabs#380 Fixes from siderolabs#401 and siderolabs#410 were not complete. Manually tested by spinning up a NVIDIA worker in AWS. Signed-off-by: Noel Georgi <git@frezbo.dev>
Set `glibc/lib` as first `rpath` for `nvidia-container-cli`. Also install nvidia libraries to `/usr/local/glibc/lib` so any musl libraries lives separately. `nvidia-container-cli` explicitly sets an `RPATH` as `$ORIGIN/../$LIB` here: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/blob/v1.14.6/Makefile?ref_type=tags#L183, this means `/usr/local/lib` would be searched first, since `zfs` and nvidia ship their own `libtirpc`, `nvidia-container-cli` first tries to use the `libtirpc` shippeed with `zfs` at `/usr/local/lib` instead of the one at `/usr/local/glibc/lib`. Fix this by setting an additional `RPATH` as `$ORIGIN/../glibc/$LIB`, so that libraries in `/usr/local/glibc/lib` have higher preference. ```bash ❯ scanelf -r _out/rootfs/rootfs/usr/local/bin/nvidia-container-cli TYPE RPATH FILE ET_DYN $ORIGIN/../glibc/$LIB:$ORIGIN/../$LIB _out/rootfs/rootfs/usr/local/bin/nvidia-container-cli ``` Properly fixes: siderolabs#380 Fixes from siderolabs#401 and siderolabs#410 were not complete. Manually tested by spinning up a NVIDIA worker in AWS. Signed-off-by: Noel Georgi <git@frezbo.dev>
Set `glibc/lib` as first `rpath` for `nvidia-container-cli`. Also install nvidia libraries to `/usr/local/glibc/lib` so any musl libraries lives separately. `nvidia-container-cli` explicitly sets an `RPATH` as `$ORIGIN/../$LIB` here: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/blob/v1.14.6/Makefile?ref_type=tags#L183, this means `/usr/local/lib` would be searched first, since `zfs` and nvidia ship their own `libtirpc`, `nvidia-container-cli` first tries to use the `libtirpc` shippeed with `zfs` at `/usr/local/lib` instead of the one at `/usr/local/glibc/lib`. Fix this by setting an additional `RPATH` as `$ORIGIN/../glibc/$LIB`, so that libraries in `/usr/local/glibc/lib` have higher preference. ```bash ❯ scanelf -r _out/rootfs/rootfs/usr/local/bin/nvidia-container-cli TYPE RPATH FILE ET_DYN $ORIGIN/../glibc/$LIB:$ORIGIN/../$LIB _out/rootfs/rootfs/usr/local/bin/nvidia-container-cli ``` Properly fixes: siderolabs#380 Fixes from siderolabs#401 and siderolabs#410 were not complete. Manually tested by spinning up a NVIDIA worker in AWS. Signed-off-by: Noel Georgi <git@frezbo.dev>
Set `glibc/lib` as first `rpath` for `nvidia-container-cli`. Also install nvidia libraries to `/usr/local/glibc/lib` so any musl libraries lives separately. `nvidia-container-cli` explicitly sets an `RPATH` as `$ORIGIN/../$LIB` here: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/blob/v1.14.6/Makefile?ref_type=tags#L183, this means `/usr/local/lib` would be searched first, since `zfs` and nvidia ship their own `libtirpc`, `nvidia-container-cli` first tries to use the `libtirpc` shippeed with `zfs` at `/usr/local/lib` instead of the one at `/usr/local/glibc/lib`. Fix this by setting an additional `RPATH` as `$ORIGIN/../glibc/$LIB`, so that libraries in `/usr/local/glibc/lib` have higher preference. ```bash ❯ scanelf -r _out/rootfs/rootfs/usr/local/bin/nvidia-container-cli TYPE RPATH FILE ET_DYN $ORIGIN/../glibc/$LIB:$ORIGIN/../$LIB _out/rootfs/rootfs/usr/local/bin/nvidia-container-cli ``` Properly fixes: siderolabs#380 Fixes from siderolabs#401 and siderolabs#410 were not complete. Manually tested by spinning up a NVIDIA worker in AWS. Signed-off-by: Noel Georgi <git@frezbo.dev>
Use a custom path for libtirpc shipped with zfs-tools so that it doesn't conflict with libtirpc built for nvidia-container-toolkit (as it's linked against glibc). Fixes: siderolabs#380 Signed-off-by: Noel Georgi <git@frezbo.dev>
Set `glibc/lib` as first `rpath` for `nvidia-container-cli`. Also install nvidia libraries to `/usr/local/glibc/lib` so any musl libraries lives separately. `nvidia-container-cli` explicitly sets an `RPATH` as `$ORIGIN/../$LIB` here: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/blob/v1.14.6/Makefile?ref_type=tags#L183, this means `/usr/local/lib` would be searched first, since `zfs` and nvidia ship their own `libtirpc`, `nvidia-container-cli` first tries to use the `libtirpc` shippeed with `zfs` at `/usr/local/lib` instead of the one at `/usr/local/glibc/lib`. Fix this by setting an additional `RPATH` as `$ORIGIN/../glibc/$LIB`, so that libraries in `/usr/local/glibc/lib` have higher preference. ```bash ❯ scanelf -r _out/rootfs/rootfs/usr/local/bin/nvidia-container-cli TYPE RPATH FILE ET_DYN $ORIGIN/../glibc/$LIB:$ORIGIN/../$LIB _out/rootfs/rootfs/usr/local/bin/nvidia-container-cli ``` Properly fixes: siderolabs#380 Fixes from siderolabs#401 and siderolabs#410 were not complete. Manually tested by spinning up a NVIDIA worker in AWS. Signed-off-by: Noel Georgi <git@frezbo.dev>
Both extensions uses shared library
libc.so
. If both present on target host then nvidia-device-plugin crashes with error:Environment
/fyi @kvaps
The text was updated successfully, but these errors were encountered: