Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CgroupV2 support #111

Closed
luodw opened this issue Oct 14, 2020 · 13 comments
Closed

CgroupV2 support #111

luodw opened this issue Oct 14, 2020 · 13 comments

Comments

@luodw
Copy link

luodw commented Oct 14, 2020

Recently, I test containerd+nvidia-container-runtime in kernal 5.4 and cgroupv2, but I find nvidia-container-cli can not run successfully because of errors as bellow shows
image

Is it planned to support cgroupv2?

@lissyx
Copy link

lissyx commented Jan 14, 2021

As highlighted in NVIDIA/nvidia-docker#1447 this is now breaking default debian sid/testing for lack of at least /sys/fs/cgroup/devices

@klueska
Copy link
Contributor

klueska commented Jan 14, 2021

@lissyx Thank you for printing out the crux of the issue.
We are in the process of rearchitecting the nvidia container stack in such a way that issues such as this should not exist in the future (because we will rely on runc (or whatever the configured container runtime is) to do all cgroup setup instead of doing it ourselves).

That said, this rearchitecting effort will take at least another 9 months to complete. I'm curious what the impact is (and how difficult it would be to add cgroupsv2 support to libnvidia-container in the meantime to prevent issues like this until the rearchitecting is complete).

@lissyx
Copy link

lissyx commented Jan 14, 2021

@lissyx Thank you for printing out the crux of the issue.
We are in the process of rearchitecting the nvidia container stack in such a way that issues such as this should not exist in the future (because we will rely on runc (or whatever the configured container runtime is) to do all cgroup setup instead of doing it ourselves).

That said, this rearchitecting effort will take at least another 9 months to complete. I'm curious what the impact is (and how difficult it would be to add cgroupsv2 support to libnvidia-container in the meantime to prevent issues like this until the rearchitecting is complete).

I have no idea, my knowledge of cgroups is really limited, I only investigated because I ran into the issue and it was blocking me. For the moment, I'm relying on the systemd parameter to switch back to hybrid as documented in the other issue, but I have no idea how solid this can be over time. I guess since the advised supported versions are only stable (debian 10, ubuntu 18.04, etc.) it's only fair to wait, and getting proper support on sid/testing is only a nice to have but can't be enforced.

@eengstrom
Copy link

So, I've seen two very different approaches to work around the current lack of cgroup v2 support:

The latter is simpler (no kernel param mods and reboot) but I've found very little discussion of the rationale for choosing one option over the other. Any advice?

In case it would matter, I'm doing this to get rootless containers (docker-rootless - version 20.10.3) running under Ubuntu 18.04 (kernel 4.15.0-135-generic) with libnvidia-container version 1.3.1.

@lissyx
Copy link

lissyx commented Feb 19, 2021

The latter is simpler (no kernel param mods and reboot) but I've found very little discussion of the rationale for choosing one option over the other. Any advice?

Speaking for myself, disabling cgroups would require quite some permissions changes that I was too lazy to maintain.

@simcop2387
Copy link

I'd also like to mention that this affects anyone running buster-backports now also.

@DimanNe
Copy link

DimanNe commented Oct 2, 2021

@klueska
nvidia docker stopped working on beta Kubuntu 21.10 (which is going to be release in a couple of weeks):

docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

@DimanNe
Copy link

DimanNe commented Oct 2, 2021

As a temporary workaround, this helps:

/etc/nvidia-container-runtime/config.toml
no-cgroups = true

Append your devices to docker run: --device /dev/nvidia0 --device /dev/nvidia1 --device /dev/nvidiactl --device /dev/nvidia-modeset --device /dev/nvidia-uvm

@klueska
Copy link
Contributor

klueska commented Dec 8, 2021

We now have an RC of libnvidia-container out that adds support for cgroupv2.

If you would like to try it out, make sure and add the experimental repo to your apt sources and install the latest packages:

For DEBs

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/libnvidia-container.list
sudo apt-get update
sudo apt-get install -y libnvidia-container-tools libnvidia-container1

For RPMs

sudo yum-config-manager --enable libnvidia-container-experimental
sudo yum install -y libnvidia-container-tools libnvidia-container1

@PhilippHomann
Copy link

Is there a estimate for v1.8.0 being released?

@elezar
Copy link
Member

elezar commented Jan 27, 2022

@PhilippHomann we are preparing the rc.2 with some additional fixes. Once that is out we should be able to get started on the final 1.8.0 release but don't have a specific timeline yet.

@klueska
Copy link
Contributor

klueska commented Jan 28, 2022

libnvidia-container-1.8.0-rc.2 is now live with some minor updates to fix some edge cases around cgroupv2 support.
Assuming you followed the above, a simple update --> install should give you the latest.

@klueska
Copy link
Contributor

klueska commented Feb 4, 2022

libnvidia-container-1.8.0 with cgroupv2 support is now GA

Release notes here:
https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.8.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants