Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-toolkit: resolve nvidia-ctk static linking workaround #1525

Closed

Conversation

dchvs
Copy link
Contributor

@dchvs dchvs commented Apr 10, 2024

Include CUDA libraries (tegra-libraries-cuda) and GoRuntime dependencies into NVIDIA Container Toolkit to resolve the static linking workaround that addresses the panic issue during the startup of nvidia-ctk. For further information, please refer to the commit: 971f014

Details about the nvidia-ctk and the linking of libraries after these changes:

root@jetson-agx-orin-devkit:~# nvidia-ctk --help 
NAME:
   NVIDIA Container Toolkit CLI - Tools to configure the NVIDIA Container Toolkit

USAGE:
   nvidia-ctk [global options] command [command options] [arguments...]

VERSION:
   1.11.0
commit: 670+d9de4a0

COMMANDS:
   hook     A collection of hooks that may be injected into an OCI spec
   runtime  A collection of runtime-related utilities for the NVIDIA Container Toolkit
   help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --debug, -d    Enable debug-level logging (default: false) [$NVIDIA_CTK_DEBUG]
   --help, -h     show help (default: false)
   --version, -v  print the version (default: false)
root@jetson-agx-orin-devkit:~# 
root@jetson-agx-orin-devkit:~# ldd -d /usr/bin/nvidia-container-cli 
	linux-vdso.so.1 (0x0000ffffb4671000)
	libnvidia-container.so.1 => /usr/bin/../lib/libnvidia-container.so.1 (0x0000ffffb45d0000)
	libcap.so.2 => /usr/bin/../lib/libcap.so.2 (0x0000ffffb45a0000)
	libc.so.6 => /usr/bin/../lib/libc.so.6 (0x0000ffffb43f0000)
	/usr/lib/ld-linux-aarch64.so.1 (0x0000ffffb4634000)
	libtirpc.so.3 => /usr/lib/libtirpc.so.3 (0x0000ffffb43a0000)
	libelf.so.1 => /usr/lib/libelf.so.1 (0x0000ffffb4360000)
	libseccomp.so.2 => /usr/lib/libseccomp.so.2 (0x0000ffffb4320000)
	libpthread.so.0 => /usr/lib/libpthread.so.0 (0x0000ffffb42f0000)
	libz.so.1 => /usr/lib/libz.so.1 (0x0000ffffb42b0000)
root@jetson-agx-orin-devkit:~# 
root@jetson-agx-orin-devkit:~# ldd -d /usr/lib/libnvidia-container.so.0 
	linux-vdso.so.1 (0x0000ffff9956c000)
	libseccomp.so.2 => /usr/lib/libseccomp.so.2 (0x0000ffff994b0000)
	libcap.so.2 => /usr/lib/libcap.so.2 (0x0000ffff99480000)
	libelf.so.1 => /usr/lib/libelf.so.1 (0x0000ffff99440000)
	libtirpc.so.3 => /usr/lib/libtirpc.so.3 (0x0000ffff993f0000)
	libc.so.6 => /usr/lib/libc.so.6 (0x0000ffff99240000)
	/lib64/ld-linux-aarch64.so.1 => /usr/lib/ld-linux-aarch64.so.1 (0x0000ffff9952f000)
	libz.so.1 => /usr/lib/libz.so.1 (0x0000ffff99200000)
	libpthread.so.0 => /usr/lib/libpthread.so.0 (0x0000ffff991d0000)
root@jetson-agx-orin-devkit:~# 
root@jetson-agx-orin-devkit:~# ldd -d /usr/lib/libnvidia-container.so.1 
	linux-vdso.so.1 (0x0000ffffa6ad7000)
	libtirpc.so.3 => /usr/lib/libtirpc.so.3 (0x0000ffffa6a10000)
	libcap.so.2 => /usr/lib/libcap.so.2 (0x0000ffffa69e0000)
	libelf.so.1 => /usr/lib/libelf.so.1 (0x0000ffffa69a0000)
	libseccomp.so.2 => /usr/lib/libseccomp.so.2 (0x0000ffffa6960000)
	libc.so.6 => /usr/lib/libc.so.6 (0x0000ffffa67b0000)
	/lib64/ld-linux-aarch64.so.1 => /usr/lib/ld-linux-aarch64.so.1 (0x0000ffffa6a9a000)
	libpthread.so.0 => /usr/lib/libpthread.so.0 (0x0000ffffa6780000)
	libz.so.1 => /usr/lib/libz.so.1 (0x0000ffffa6740000)
root@jetson-agx-orin-devkit:~# 
root@jetson-agx-orin-devkit:~# ldd -d /usr/bin/nvidia-container-runtime 
	linux-vdso.so.1 (0x0000ffffac644000)
	libstd.so => /usr/lib/go/pkg/linux_arm64_dynlink/libstd.so (0x0000ffffa9730000)
	libcuda.so.1 => /usr/lib/libcuda.so.1 (0x0000ffffa80d0000)
	libc.so.6 => /usr/lib/libc.so.6 (0x0000ffffa7f20000)
	libnvrm_gpu.so => /usr/lib/libnvrm_gpu.so (0x0000ffffa7eb0000)
	libnvrm_mem.so => /usr/lib/libnvrm_mem.so (0x0000ffffa7e90000)
	libm.so.6 => /usr/lib/libm.so.6 (0x0000ffffa7de0000)
	libdl.so.2 => /usr/lib/libdl.so.2 (0x0000ffffa7db0000)
	librt.so.1 => /usr/lib/librt.so.1 (0x0000ffffa7d80000)
	libpthread.so.0 => /usr/lib/libpthread.so.0 (0x0000ffffa7d50000)
	libnvrm_sync.so => /usr/lib/libnvrm_sync.so (0x0000ffffa7d30000)
	libnvrm_host1x.so => /usr/lib/libnvrm_host1x.so (0x0000ffffa7d00000)
	/usr/lib/ld-linux-aarch64.so.1 (0x0000ffffac607000)
	libnvos.so => /usr/lib/libnvos.so (0x0000ffffa7cd0000)
	libnvsocsys.so => /usr/lib/libnvsocsys.so (0x0000ffffa7cb0000)
	libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x0000ffffa7a50000)
	libnvsciipc.so => /usr/lib/libnvsciipc.so (0x0000ffffa7a20000)
	libnvrm_chip.so => /usr/lib/libnvrm_chip.so (0x0000ffffa7a00000)
	libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x0000ffffa79c0000)
root@jetson-agx-orin-devkit:~# 

dchvs added 5 commits April 10, 2024 21:31
…seccomp and libtirpc

* Switch to libtirpc from tirpc126
To match the libnvidia-container dependency on libtirpc

* Leverage pkg-config for automated build flag retrieval on 0002-OE-cross-build-fixups.patch
Utilize pkg-config into the 0002-OE-cross-build-fixups.patch to streamline the retrieval of build flags within the
package build system, enhancing automation and maintainability

* Refresh patches index

Signed-off-by: Daniel Chaves <dchvs11@gmail.com>
…mp and libtirpc

* Leverage pkg-config for automated build flag retrieval
    Utilize pkg-config into the 0001-OE-cross-build-fixups.patch to
streamline the retrieval of build flags within the package build system; enhancing
automation and maintainability

* Refresh patches

Signed-off-by: Daniel Chaves <dchvs11@gmail.com>
… uses Poky's libtirpc version

Signed-off-by: Daniel Chaves <dchvs11@gmail.com>
… configuration

As the libnvidia-container is compiled with the flag WITH_NVCGO=no

Signed-off-by: Daniel Chaves <dchvs11@gmail.com>
…dependencies

* Add CUDA libraries (tegra-libraries-cuda) and GoRuntime dependencies to fix
the static linking workaround that causes panic on nvidia-ctk startup.
For more details refer to commit: OE4T@971f014

Signed-off-by: Daniel Chaves <dchvs11@gmail.com>
@madisongh
Copy link
Member

There appears to be more to this than just trying to switch away from static linking.

For the static linking workaround, upstream OE-Core has disabled dynamic linking for all Go packages on all target architectures now anyway, so I'm not sure it's worth trying to resolve that particular problem.

Disabling cgroupsv2 support doesn't sound like a good idea, based on what I see in the issue that you linked to, so I'm not sure why we'd want to do this. Can you provide more info on why you made that particular change?

There was a reason why we used the older libtirpc for libnvidia-container-jetson - there were changes in libtirpc that the old NVIDIA code wasn't compatible with, causing exceptions. If that's been resolved somehow, great, but really what I'd rather see is an adaptation of the patches 01ba56b for the 1.11 version of the toolkit, so we can do away with using "legacy" mode and drop the libnvidia-container-jetson recipe completely.

@ichergui
Copy link
Member

Hi @dchvs
Any update about this PR ?
Or we should close it ?

@dchvs
Copy link
Contributor Author

dchvs commented Apr 26, 2024

Hi, guys

Disabling cgroupsv2 was necessary due to an error encountered when starting the container, which is similar or identical to the one described here: NVIDIA/nvidia-docker#1660 and NVIDIA/nvidia-docker#1660 (comment).

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

This issue might have another solution or explanation besides disabling cgroupsv2, which I intend to investigate what's exactly missing.

Regarding the static linking workaround, I came across this "goarch: disable dynamic linking globally" patch: OpenEmbedded-Core patch link. However, it appears that this change was reverted last week: Revert patch link. Is this the same issue within OE-Core that you mentioned?

In regard to the statement about using the older version of libtirpc for libnvidia-container-jetson due to compatibility issues, do you know of a way to test whether these exceptions are still present?

@madisongh
Copy link
Member

On the cgroupsv2 problem, I saw those nvidia-docker issues, but it wasn't clear to me how they related. If you have more detail on how to reproduce the error message you are seeing, I'd like to see it, so we can get to the root cause. I've never run into that particular error myself.

Yep, I see that the shared-runtime changes got put back in OE-Core, heaven help us. It has never played well with cgo, though, in my experience.

As for the libtirpc, it was #760 that triggered the changes for using the older version. It sounds, though, like the underlying issue was a bug in the NVIDIA code that subsequently got fixed, so maybe we'll be OK with dropping them.

@madisongh
Copy link
Member

OK, I can reproduce the cgroups problem. Looks like it's a combination of the libnvidia-container library (maybe both the main one and the legacy Jetson-specific one, I'm not sure) expecting to see the legacy cgroupsv1 setup in the sysfs, which systemd deprecated, and turned off by default, as of version 252. Adding systemd.unified_cgroup_hieararchy=off to your kernel command line still re-enables it, though, even with the version of systemd currently in scarthgap and master. That's probably a better solution than turning off cgroups support for containers.

@madisongh
Copy link
Member

@dchvs Could you take a look at #1541 , which eliminates the libnvidia-container-jetson library completely and allows containers to run with cgroupsv2, so you don't have to disable cgroup support (or the unified cgroup hierarchy in systemd)? The old version of libtirpc is also dropped there, since it was used only for the libnvidia-container-jetson stuff.

The PR still links the go binaries statically, but otherwise should solve the problems you were seeing.

@dchvs
Copy link
Contributor Author

dchvs commented Apr 29, 2024

Sure, will do!

@madisongh
Copy link
Member

See #1541

@madisongh madisongh closed this May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants