Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container support broken on master #760

Closed
madisongh opened this issue Jul 23, 2021 · 2 comments · Fixed by #763
Closed

Container support broken on master #760

madisongh opened this issue Jul 23, 2021 · 2 comments · Fixed by #763

Comments

@madisongh
Copy link
Member

Describe the bug
The nvidia-container-toolkit program is crashing with a segmentation fault when trying to start a container.

The segfault is happening during teardown of the RPC communication it uses, which appears to be due to the newer libtirpc version (1.3.2) in OE-Core master. Replacing the use of that version with a statically-linked copy of the libtirpc pulled from OE-Core dunfell eliminates the segfault, but setup still fails with:

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver client creation failed: RPC: Remote system error - Cannot allocate memory: unknown.

To Reproduce
Steps to reproduce the behavior:

  1. Use tegra-demo-distro, branch master
  2. Build demo-image-full
  3. Load onto target (tested with Xavier NX devkit)
  4. Try docker run --net=host --runtime nvidia --rm --ipc=host --cap-add SYS_PTRACE -e DISPLAY=$DISPLAY -it nvcr.io/nvidia/l4t-base:r32.5.0
@madisongh
Copy link
Member Author

madisongh commented Jul 25, 2021

This is due libtirpc trying to allocate arrays based on the fd table size, which has gone from thousands to billions in size. It also appears that libtripc isn't properly handling memory allocation failures in some of its code paths, leading to the segmentation faults.

You can work around the problem by explicitly using --ulimit nofile=1024:4096 , or some other more reasonable limits on the docker run command, but #763 patches the version of the RPC library statically linked into the container tools to cap the array sizes down to 1K, to work around the problem.

(EDIT: The workaround mentioned above worked for me with the original upstream patch to libtirpc applied. You might be able to make it work without patching libtirpc at all by also setting your own process's ulimit -H 4096, but I haven't actually tested this.)

@ichergui
Copy link
Member

Hey @madisongh

I tried docker and I got the same issue as you.

  • Here is the logs
root@jetson-tx2-devkit:~# docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-ml:r32.5.0-py3
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver client creation failed: RPC: Remote system error - Cannot allocate memory: unknown.
root@jetson-tx2-devkit:~# 

To Reproduce
Steps to reproduce the behavior:

  1. Use tegra-demo-distro, branch master
  2. Build demo-image-full
  3. Load onto target (tested with Jetson TX2 devkit)
  4. Try the following commands
# docker pull nvcr.io/nvidia/l4t-ml:r32.5.0-py3
# docker run -it --rm --runtime nvidia --network host nvcr.io/nvidia/l4t-ml:r32.5.0-py3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants