Container support broken on master #760

madisongh · 2021-07-23T20:05:58Z

Describe the bug
The nvidia-container-toolkit program is crashing with a segmentation fault when trying to start a container.

The segfault is happening during teardown of the RPC communication it uses, which appears to be due to the newer libtirpc version (1.3.2) in OE-Core master. Replacing the use of that version with a statically-linked copy of the libtirpc pulled from OE-Core dunfell eliminates the segfault, but setup still fails with:

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver client creation failed: RPC: Remote system error - Cannot allocate memory: unknown.

To Reproduce
Steps to reproduce the behavior:

Use tegra-demo-distro, branch master
Build demo-image-full
Load onto target (tested with Xavier NX devkit)
Try docker run --net=host --runtime nvidia --rm --ipc=host --cap-add SYS_PTRACE -e DISPLAY=$DISPLAY -it nvcr.io/nvidia/l4t-base:r32.5.0

The text was updated successfully, but these errors were encountered:

madisongh · 2021-07-25T13:39:23Z

This is due libtirpc trying to allocate arrays based on the fd table size, which has gone from thousands to billions in size. It also appears that libtripc isn't properly handling memory allocation failures in some of its code paths, leading to the segmentation faults.

~~You can work around the problem by explicitly using --ulimit nofile=1024:4096 , or some other more reasonable limits on the docker run command, but~~ #763 patches the version of the RPC library statically linked into the container tools to cap the array sizes down to 1K, to work around the problem.

(EDIT: The workaround mentioned above worked for me with the original upstream patch to libtirpc applied. You might be able to make it work without patching libtirpc at all by also setting your own process's ulimit -H 4096, but I haven't actually tested this.)

ichergui · 2021-07-25T13:44:46Z

Hey @madisongh

I tried docker and I got the same issue as you.

Here is the logs

root@jetson-tx2-devkit:~# docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-ml:r32.5.0-py3
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver client creation failed: RPC: Remote system error - Cannot allocate memory: unknown.
root@jetson-tx2-devkit:~#

To Reproduce
Steps to reproduce the behavior:

Use tegra-demo-distro, branch master
Build demo-image-full
Load onto target (tested with Jetson TX2 devkit)
Try the following commands

# docker pull nvcr.io/nvidia/l4t-ml:r32.5.0-py3
# docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-ml:r32.5.0-py3

This was referenced Jul 24, 2021

Fixes for master #762

Merged

external/virtualization-layer: patch libtirpc126 to limit fd table sizes #763

Merged

madisongh closed this as completed in #763 Jul 25, 2021

Code-Gratefully mentioned this issue Aug 13, 2021

OCI runtime create failed: container - Segmentation Fault NVIDIA/nvidia-container-runtime#150

Closed

madisongh mentioned this issue Apr 27, 2024

nvidia-container-toolkit: resolve nvidia-ctk static linking workaround #1525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container support broken on master #760

Container support broken on master #760

madisongh commented Jul 23, 2021

madisongh commented Jul 25, 2021 •

edited

Loading

ichergui commented Jul 25, 2021

Container support broken on master #760

Container support broken on master #760

Comments

madisongh commented Jul 23, 2021

madisongh commented Jul 25, 2021 • edited Loading

ichergui commented Jul 25, 2021

madisongh commented Jul 25, 2021 •

edited

Loading