Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

debian11 #1549

Closed
chaiyd opened this issue Sep 22, 2021 · 65 comments
Closed

debian11 #1549

chaiyd opened this issue Sep 22, 2021 · 65 comments

Comments

@chaiyd
Copy link

chaiyd commented Sep 22, 2021

  • Please support debian11
@chaiyd
Copy link
Author

chaiyd commented Sep 23, 2021

  • Debian11 use systemd(cgroupsv2)by default.
  • Docker do not use privileged,will report Failed to initialize NVML: Unknown Error

@redskinhu
Copy link

I tried to install on Debian 11 but I got this error message:

# Unsupported distribution!
# Check https://nvidia.github.io/nvidia-docker

Checked and I see Debian 11 is not supported yet.

When we can expect it?

Linux 5.10.0-9-amd64 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux
Docker: 20.10.9
NV: 470.57.02-1

@klueska
Copy link
Contributor

klueska commented Oct 11, 2021

The main blocker at the moment for debian11 is cgroupv2 support. We have two ongoing efforts for this at the moment:

  1. We are rearchitecting the NVIDIA container stack so that most of the functionality around injecting libs/devices/binaries can happen in a shim above the OCI runtime. At which point, we get cgroupv2 for free.

  2. Direct cgroupv2 support in libnvidia-container, which is proving to be more involved than originally hoped and is currently on pause for the moment: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/96

If you need nvidia-docker to work on debian11 now you can:

  1. Install it as if it was coming from debian10, i.e.:
distribution=$(. /etc/os-release;echo debian10)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
  sudo tee /etc/yum.repos.d/nvidia-docker.repo
  1. Disable cgroup support in nvidia-docker, i.e.:
$ cat /etc/nvidia-container-runtime/config.toml
...
no-cgroups = true
...
  1. Manually inject any nvidia device nodes you need access to on the docker command line, e.g.:
docker run -runtime=nvidia \
    --gpus '"device=0,1"' \
    --device /dev/nvidiactl \
    --device /dev/nvidia0 \
    --device /dev/nvidia1 \
    ubuntu:20.04 \
    nvidia-smi

Note, if you are running with Kubernetes, the equivalent of --device ... is handled for you if you run the k8s-device-plugin with the --pass-device-specs flag: https://github.com/NVIDIA/k8s-device-plugin/blob/master/cmd/nvidia-device-plugin/main.go#L61

@chaiyd
Copy link
Author

chaiyd commented Oct 12, 2021

The main blocker at the moment for debian11 is cgroupv2 support. We have two ongoing efforts for this at the moment:

  1. We are rearchitecting the NVIDIA container stack so that most of the functionality around injecting libs/devices/binaries can happen in a shim above the OCI runtime. At which point, we get cgroupv2 for free.
  2. Direct cgroupv2 support in libnvidia-container, which is proving to be more involved than originally hoped and is currently on pause for the moment: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/96

If you need nvidia-docker to work on debian11 now you can:

  1. Install it as if it was coming from debian10, i.e.:
distribution=$(. /etc/os-release;echo debian10)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
  sudo tee /etc/yum.repos.d/nvidia-docker.repo
  1. Disable cgroup support in nvidia-docker, i.e.:
$ cat /etc/nvidia-container-runtime/config.toml
...
no-cgroups = true
...
  1. Manually inject any nvidia device nodes you need access to on the docker command line, e.g.:
docker run -runtime=nvidia \
    --gpus '"device=0,1"' \
    --device /dev/nvidiactl \
    --device /dev/nvidia0 \
    --device /dev/nvidia1 \
    ubuntu:20.04 \
    nvidia-smi

Note, if you are running with Kubernetes, the equivalent of --device ... is handled for you if you run the k8s-device-plugin with the --pass-device-specs flag: https://github.com/NVIDIA/k8s-device-plugin/blob/master/cmd/nvidia-device-plugin/main.go#L61

  • Not only disable cgroup support no-cgroups = true
  • At present, to run NVIDIA docker normally, debian11 needs to completely shut down cgroups V2, or docker use privileged. which is not what we want
  • We hope to support cgroups V2 as soon as possible

@klueska
Copy link
Contributor

klueska commented Oct 12, 2021

  • Not only disable cgroup support no-cgroups = true
  • At present, to run NVIDIA docker normally, debian11 needs to completely shut down cgroups V2, or docker use privileged. which is not what we want

Are you saying that setting no-cgroups = true still results in the container failing to start, or that you just don't get the devices injected in as you do when no-cgroups = false?

If the container fails to start, then that's a bug and I'd like to know what the error message is.

If it starts, but the devices are not present, then that is by design, and you will need to do manual injection of the devices as outlined in the final step of my previous comment.

It's obviously not an ideal solution, but it's a way to make things work until cgroupv2 is officially supported.

@chaiyd
Copy link
Author

chaiyd commented Oct 12, 2021

  • Not only disable cgroup support no-cgroups = true
  • At present, to run NVIDIA docker normally, debian11 needs to completely shut down cgroups V2, or docker use privileged. which is not what we want

Are you saying that setting no-cgroups = true still results in the container failing to start, or that you just don't get the devices injected in as you do when no-cgroups = false?

If the container fails to start, then that's a bug and I'd like to know what the error message is.

If it starts, but the devices are not present, then that is by design, and you will need to do manual injection of the devices as outlined in the final step of my previous comment.

It's obviously not an ideal solution, but it's a way to make things work until cgroupv2 is officially supported.

  • only setting no-cgroups = true
# docker  run --rm --gpus all --device /dev/nvidiactl --device /dev/nvidia0  ubuntu:20.04 nvidia-smi

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
  • setting ldconfig = "/sbin/ldconfig"
# cat /etc/nvidia-container-runtime/config.toml
...
no-cgroups = true
...
ldconfig = "/sbin/ldconfig" 
...

@chaiyd
Copy link
Author

chaiyd commented Oct 12, 2021

I tried to install on Debian 11 but I got this error message:

# Unsupported distribution!
# Check https://nvidia.github.io/nvidia-docker

Checked and I see Debian 11 is not supported yet.

When we can expect it?

Linux 5.10.0-9-amd64 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux Docker: 20.10.9 NV: 470.57.02-1

  • You can try the following
  • But pay attention to the compatibility with Debian11
# distribution=$(. /etc/os-release;echo debian10) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# apt update 
# apt install nvidia-docker2

@redskinhu
Copy link

redskinhu commented Oct 12, 2021

Sounds good, worth to try. Tomorrow.
I thinked about something like this, but I did't know/find what I need to substitute to $distribution.

Thx

@redskinhu
Copy link

Hi

apt install nvidia-docker2 installing it without any error.
But when I want to do the next step on my setup ( blakeblackshear/frigate#1847 (comment) )

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

I got this error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

Any idea?

@klueska
Copy link
Contributor

klueska commented Oct 20, 2021

Yes, this is due to libnvidia-container not supporting cgroupv2 as discussed here:
#1549 (comment)

@redskinhu
Copy link

Thanks for the info.

@Lecrapouille
Copy link

@redskinhu

apt install nvidia-docker2 installing it without any error.

How have you set your repo list ? I have non-free list but nvidia-docker2 is unknown on my debian-11

@elezar
Copy link
Member

elezar commented Oct 27, 2021

@Lecrapouille you should be able to download / install the Debian 10 packages from the repository.

@chaiyd
Copy link
Author

chaiyd commented Oct 27, 2021

@redskinhu

apt install nvidia-docker2 installing it without any error.

How have you set your repo list ? I have non-free list but nvidia-docker2 is unknown on my debian-11

# distribution=$(. /etc/os-release;echo debian10) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# apt update 
# apt install nvidia-docker2

@redskinhu
Copy link

redskinhu commented Oct 27, 2021

Hello

I did the install based on this guide:
blakeblackshear/frigate#1847 (comment)

@Lecrapouille
Copy link

Lecrapouille commented Oct 27, 2021

Thanks! I finally get gpu info when applying:

PS: @redskinhu your link is not working

@redskinhu
Copy link

Corrected, Thx

@frederico-klein
Copy link

  • Not only disable cgroup support no-cgroups = true
  • At present, to run NVIDIA docker normally, debian11 needs to completely shut down cgroups V2, or docker use privileged. which is not what we want

Are you saying that setting no-cgroups = true still results in the container failing to start, or that you just don't get the devices injected in as you do when no-cgroups = false?
If the container fails to start, then that's a bug and I'd like to know what the error message is.
If it starts, but the devices are not present, then that is by design, and you will need to do manual injection of the devices as outlined in the final step of my previous comment.
It's obviously not an ideal solution, but it's a way to make things work until cgroupv2 is officially supported.

* only setting `no-cgroups = true`
# docker  run --rm --gpus all --device /dev/nvidiactl --device /dev/nvidia0  ubuntu:20.04 nvidia-smi

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
* setting `ldconfig = "/sbin/ldconfig" `
# cat /etc/nvidia-container-runtime/config.toml
...
no-cgroups = true
...
ldconfig = "/sbin/ldconfig" 
...

Installing the debian10 nvidia docker2 AND

editing /etc/nvidia-container-runtime/config.toml

WITH the

 --device /dev/nvidiactl --device /dev/nvidia0 

(and restarting the docker service ) did it for me

@chaiyd
Copy link
Author

chaiyd commented Nov 18, 2021

  • Not only disable cgroup support no-cgroups = true
  • At present, to run NVIDIA docker normally, debian11 needs to completely shut down cgroups V2, or docker use privileged. which is not what we want

Are you saying that setting no-cgroups = true still results in the container failing to start, or that you just don't get the devices injected in as you do when no-cgroups = false?
If the container fails to start, then that's a bug and I'd like to know what the error message is.
If it starts, but the devices are not present, then that is by design, and you will need to do manual injection of the devices as outlined in the final step of my previous comment.
It's obviously not an ideal solution, but it's a way to make things work until cgroupv2 is officially supported.

* only setting `no-cgroups = true`
# docker  run --rm --gpus all --device /dev/nvidiactl --device /dev/nvidia0  ubuntu:20.04 nvidia-smi

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
* setting `ldconfig = "/sbin/ldconfig" `
# cat /etc/nvidia-container-runtime/config.toml
...
no-cgroups = true
...
ldconfig = "/sbin/ldconfig" 
...

Installing the debian10 nvidia docker2 AND

editing /etc/nvidia-container-runtime/config.toml

WITH the

 --device /dev/nvidiactl --device /dev/nvidia0 

(and restarting the docker service ) did it for me

@frederico-klein Yeah, this is no problem, but it is not an ideal solution. we need a better solution.

@galosre
Copy link

galosre commented Nov 20, 2021

not sure about this
editing /etc/nvidia-container-runtime/config.toml

with the

--device /dev/nvidiactl --device /dev/nvidia0
so I had this in editing /etc/nvidia-container-runtime/config.toml
and also got no-cgroups = true
...
ldconfig = "/sbin/ldconfig"
but in portainer runtime dropdown I do not have Nvidia ?

@chaiyd
Copy link
Author

chaiyd commented Nov 20, 2021

not sure about this
editing /etc/nvidia-container-runtime/config.toml

with the

--device /dev/nvidiactl --device /dev/nvidia0
so I had this in editing /etc/nvidia-container-runtime/config.toml
and also got no-cgroups = true
...
ldconfig = "/sbin/ldconfig"
but in portainer runtime dropdown I do not have Nvidia ?

Are you sure you have installed cuda or Nvidia-derive ?

@galosre
Copy link

galosre commented Nov 21, 2021

NVIDIA-SMI 495.44 is running what am I missing?
ok got nvidia in runtime portainer but still not working
docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smi
Sun Nov 21 03:25:37 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 |
but if I add --device /dev/nvidiactl --device /dev/nvidia0 to config.toml
I got docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: couldn't open default configuration file: Near line 17 (last key parsed 'nvidia-container-cli.--device'): expected key separator '=', but got '/' instead: unknown.

@galosre
Copy link

galosre commented Nov 21, 2021

Running my container in privileged mode got it working

@chaiyd
Copy link
Author

chaiyd commented Nov 22, 2021

Running my container in privileged mode got it working

ok

@galosre
Copy link

galosre commented Dec 2, 2021

Today my system was updated with
nvidia-container-toolkit:amd64 1.6.0-1 1.7.0-1
libnvidia-container-tools:amd64 1.6.0-1 1.7.0-1
libnvidia-container1:amd64 1.6.0-1 1.7.0-1
So now how to use those package?

@chenhengqi
Copy link

Have to mention here, I use cgroup v2 :)

$ ctr run --rm --runtime=io.containerd.runtime.v1.linux --env NVIDIA_VISIBLE_DEVICES=0 nvcr.io/nvidia/cuda:11.4.2-base-ubuntu20.04 test nvidia-smi -L
ctr: cgroups: cgroup mountpoint does not exist: unknown

@klueska
Copy link
Contributor

klueska commented Dec 3, 2021

Have to mention here, I use cgroup v2 :)

$ ctr run --rm --runtime=io.containerd.runtime.v1.linux --env NVIDIA_VISIBLE_DEVICES=0 nvcr.io/nvidia/cuda:11.4.2-base-ubuntu20.04 test nvidia-smi -L
ctr: cgroups: cgroup mountpoint does not exist: unknown

Yes, I realize this, which is why you are wanting to set no-cgroup = true in the config.toml.
The fact that you were able to run the above and get the error you see means that you are finally now running through the runtime (and not the CLI directly) so you should be able to now put the workaround in place and get things to work.

@chenhengqi
Copy link

@klueska The following workaround does not work because the --no-cgroups option only apply to configure subcommand, NOT the cli itself.

It's a hack, but if you want this to work with ctr you will need to wrap the nvidia-container-cli as

$ mv /usr/bin/nvidia-container-cli /usr/bin/nvidia-container-cli.real
... create a wrapper /usr/bin/nvidia-container-cli...

$ cat /usr/bin/nvidia-container-cli
#!/usr/bin/env bash
exec /usr/bin/nvidia-container-cli.real --no-cgroups "$@"

@klueska
Copy link
Contributor

klueska commented Dec 6, 2021

You're right, I didn't test my suggestion (since it really is a pretty brutal hack). You would need to do something more sophisticated like:

#!/usr/bin/env bash

if [[ "${*}" != *"configure"* ]]; then
        exec /usr/bin/nvidia-container-cli.real "$@"
fi

exec /usr/bin/nvidia-container-cli.real "${@:1:$#-1}" --no-cgroups "${@: -1}"

(which I did test this time).

@klueska
Copy link
Contributor

klueska commented Dec 8, 2021

We now have an RC of libnvidia-container out that adds support for cgroupv2.

If you would like to try it out, make sure and add the experimental repo to your apt sources and install the latest packages:

For DEBs

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/libnvidia-container.list
sudo apt-get update
sudo apt-get install -y libnvidia-container-tools libnvidia-container1

For RPMs

sudo yum-config-manager --enable libnvidia-container-experimental
sudo yum install -y libnvidia-container-tools libnvidia-container1

@chenhengqi
Copy link

CentOS 8 + Cgroup v2 + containerd

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
>    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
$ dnf config-manager --set-enabled libnvidia-container-experimental
$ dnf install -y libnvidia-container-tools libnvidia-container1 nvidia-container-runtime

The following commands now works

$ ctr image pull docker.io/nvidia/cuda:11.0-base
$ ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0-base cuda-11.0-base nvidia-smi

Cheers ! 💯

@chaiyd
Copy link
Author

chaiyd commented Dec 9, 2021

CentOS 8 + Cgroup v2 + containerd

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
>    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
$ dnf config-manager --set-enabled libnvidia-container-experimental
$ dnf install -y libnvidia-container-tools libnvidia-container1 nvidia-container-runtime

The following commands now works

$ ctr image pull docker.io/nvidia/cuda:11.0-base
$ ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0-base cuda-11.0-base nvidia-smi

Cheers ! 💯

Debian11 doesn't fragrant?

@klueska
Copy link
Contributor

klueska commented Dec 9, 2021

This RC was tested almost exclusively on a Debian 11 system, so I'd be surprised if it's not working there.

That said, we don't yet officially support Debian 11, so you will need to add the apt repo for Debian 10:

distribution=$(. /etc/os-release;echo debian10)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
  sudo tee /etc/yum.repos.d/nvidia-docker.repo

And (as mentioned above) to get access to the RC package, you will need to enable the experimental repo:

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/libnvidia-container.list
sudo apt-get update

And install it:

sudo apt-get  install -y libnvidia-container-tools libnvidia-container1

You should then see the following versions installed:

$ sudo dpkg --list libnvidia-container1 libnvidia-container-tools
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                     Version                   Architecture              Description
+++-========================================-=========================-=========================-=====================================================================================
ii  libnvidia-container-tools                1.8.0~rc.1-1              amd64                     NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64               1.8.0~rc.1-1              amd64                     NVIDIA container runtime library

@chenhengqi
Copy link

When will there be a GA release of this new libnvidia-container ?

@klueska
Copy link
Contributor

klueska commented Dec 9, 2021

Sometime in the new year after it has been thoroughly tested and certified.

@chaiyd
Copy link
Author

chaiyd commented Dec 9, 2021

This is very good news,looking forward to GA. At the same time, I also hope to consider supporting Debian ARM.

@klueska
Copy link
Contributor

klueska commented Dec 9, 2021

In general this should also work on ARM, though you'll likely need to set the distribution to ubuntu20.04 to get the packages you want since we don't yet "officially support" debian systems on ARM. (All "officially supporting" means is that we don't do thorough testing on these platforms, but often the same packages from other distributions work there).

From my testing yesterday on a g5g.xlarge instance on EC2 with ubuntu 20.04:

Note: The initial setup tests against cgroupv1 which is why the existing stable releases work.

Setup

$ history
    1  wget https://us.download.nvidia.com/tesla/470.82.01/NVIDIA-Linux-aarch64-470.82.01.run
    2  chmod a+x NVIDIA-Linux-aarch64-470.82.01.run
    3  sudo apt-get update
    4  sudo apt-get install build-essential linux-headers-5.11.0-1020-aws
    5  sudo ./NVIDIA-Linux-aarch64-470.82.01.run
    6  curl https://get.docker.com | sh   && sudo systemctl --now enable docker
    7  distribution=$(. /etc/os-release;echo $ID$VERSION_ID)    && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    8  sudo apt-get update
    9  sudo apt-get install -y nvidia-docker2
   10  sudo systemctl restart docker

Baseline test with latest stable libnvidia-container1-1.7.0 and libnvidia-container-tools-1.7.0:

$ sudo docker run --rm --gpus all nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi
Wed Dec  8 11:37:16 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T4G          Off  | 00000000:00:1F.0 Off |                    0 |
| N/A   70C    P0    19W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Install new libnvidia-container1-1.8.0~rc.1 and libnvidia-container-tools-1.8.0~rc.1:

$ sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/libnvidia-container.list
$ sudo apt-get update
$ sudo apt-get install -y libnvidia-container-tools libnvidia-container1

Same baseline test with newer packages:

$ sudo docker run --rm --gpus all nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi
Wed Dec  8 11:41:18 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T4G          Off  | 00000000:00:1F.0 Off |                    0 |
| N/A   53C    P0    17W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Remove libnvidia-container-go.so.1.8.0 and rerun (to verify we are going through the new nvcgo implementation of cgroup manipulation):

$ sudo rm -rf /usr/lib/aarch64-linux-gnu/libnvidia-container-go.so.1.8.0

Rerun baseline test

$ sudo docker run --rm --gpus all nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: load library failed: libnvidia-container-go.so.1: cannot open shared object file: no such file or directory: unknown.

Restore libnvidia-container-go.so.1.8.0:

$ sudo apt-get --reinstall install libnvidia-container1

Run a series of tests on other container OSs:

$ sudo docker run --rm --gpus all nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi -L
GPU 0: NVIDIA T4G (UUID: GPU-ac104454-a5f6-5732-dc35-225062fdbeb7)
$ sudo docker run --rm --gpus all ubuntu:18.04 nvidia-smi -L
GPU 0: NVIDIA T4G (UUID: GPU-ac104454-a5f6-5732-dc35-225062fdbeb7)
$ sudo docker run --rm --gpus all centos:7 nvidia-smi -L
GPU 0: NVIDIA T4G (UUID: GPU-ac104454-a5f6-5732-dc35-225062fdbeb7)

Follow the instructions at the following link to enable cgroupv2:

https://rootlesscontaine.rs/getting-started/common/cgroup2

Reboot the machine and enable logging for the toolkit:

$ sudo vi /etc/nvidia-container-runtime/config.toml
...
- #debug = "/var/log/nvidia-container-toolkit.log"
+ debug = "/var/log/nvidia-container-toolkit.log"
...

Rerun the three tests from above:

$ sudo docker run --rm --gpus all nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi -L
GPU 0: NVIDIA T4G (UUID: GPU-ac104454-a5f6-5732-dc35-225062fdbeb7)
$ sudo docker run --rm --gpus all ubuntu:18.04 nvidia-smi -L
GPU 0: NVIDIA T4G (UUID: GPU-ac104454-a5f6-5732-dc35-225062fdbeb7)
$ sudo docker run --rm --gpus all centos:7 nvidia-smi -L
GPU 0: NVIDIA T4G (UUID: GPU-ac104454-a5f6-5732-dc35-225062fdbeb7)

Verify in the logs that these were run with cgroupv2 detected:

$ grep "detected cgroup" /var/log/nvidia-container-toolkit.log
I1208 11:59:35.130746 1782 nvc_container.c:272] detected cgroupv2
I1208 11:59:42.452570 1926 nvc_container.c:272] detected cgroupv2
I1208 11:59:49.526292 2071 nvc_container.c:272] detected cgroupv2

@chaiyd
Copy link
Author

chaiyd commented Dec 9, 2021

After I have the arm device, I'll try again

@TaridaGeorge
Copy link

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/libnvidia-container.list

sed: can't read /etc/apt/sources.list.d/libnvidia-container.list: No such file or directory

what I'm doing wrong?

@klueska
Copy link
Contributor

klueska commented Dec 9, 2021

You might instead have an nvidia-docker.list or a nvidia-container-runtime.list file instead of a libnvidia-container.list file.

The command will be the same just swap out the file name at the end.

@TaridaGeorge
Copy link

Yup. Thank you!

@TaridaGeorge
Copy link

TaridaGeorge commented Dec 11, 2021

If I use the experimental versions do I still need to set the no-cgroups to true and the ldconfig to "/sbin/ldconfig" in the /etc/nvidia-container-runtime/config.toml? Also do I still need to Manually inject any nvidia device nodes you need access to on the docker command line, e.g.:?

@chenhengqi
Copy link

If I use the experimental versions do I still need to set the no-cgroups to true and the ldconfig to "/sbin/ldconfig" in the /etc/nvidia-container-runtime/config.toml? Also do I still need to Manually inject any nvidia device nodes you need access to on the docker command line, e.g.:?

FYI, on CentOS 8, no extra configs are needed. Everything works fine.

@chenhengqi
Copy link

@klueska @elezar

Hello, do you guys happen to know how to install NVIDIA drivers (version 418) on CentOS 8 ?

The installer (from http://download.nvidia.com/XFree86/Linux-x86_64/418.113/NVIDIA-Linux-x86_64-418.113-no-compat32.run ) failed with:

$ ./NVIDIA-Linux-x86_64-418.113-no-compat32.run --ui=none --disable-nouveau --no-install-libglvnd --dkms -s
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 418.113........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: One or more modprobe configuration files to disable Nouveau are already present at:
         /usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf, /etc/modprobe.d/nvidia-installer-disable-nouveau.conf.
         Please be sure you have rebooted your system since these files were written.  If you have rebooted, then Nouveau
         may be enabled for other reasons, such as being included in the system initial ramdisk or in your X configuration
         file.  Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to
         correctly disable the Nouveau kernel driver.


WARNING: nvidia-installer was forced to guess the X library path '/usr/lib64' and X module path '/usr/lib64/xorg/modules';
         these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install
         the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.


ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 418.113 -k 4.18.0-348.2.1.el8_5.x86_64`: 
       Building module:
       cleaning build area...
       'make' -j20 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=4.18.0-348.2.1.el8_5.x86_64 IGNORE_CC_MISMATCH=''
       modules....(bad exit status: 2)
       Error! Bad return status for module build on kernel: 4.18.0-348.2.1.el8_5.x86_64 (x86_64)
       Consult /var/lib/dkms/nvidia/418.113/build/make.log for more information.


ERROR: Failed to install the kernel module through DKMS. No kernel module was installed; please try installing again
       without DKMS, or check the DKMS logs for more information.


ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find
       suggestions on fixing installation problems in the README available on the Linux driver download page at
       www.nvidia.com.
DKMS make.log for nvidia-418.113 for kernel 4.18.0-348.2.1.el8_5.x86_64 (x86_64)
Tue Dec 14 19:54:24 CST 2021
make[1]: Entering directory '/usr/src/kernels/4.18.0-348.2.1.el8_5.x86_64'
make[2]: Entering directory '/usr/src/kernels/4.18.0-348.2.1.el8_5.x86_64'
  SYMLINK /var/lib/dkms/nvidia/418.113/build/nvidia/nv-kernel.o
  SYMLINK /var/lib/dkms/nvidia/418.113/build/nvidia-modeset/nv-modeset-kernel.o
 CONFTEST: INIT_WORK
 CONFTEST: remap_pfn_range
 CONFTEST: hash__remap_4k_pfn
 CONFTEST: follow_pfn
 CONFTEST: vmap
 CONFTEST: set_pages_uc
 CONFTEST: list_is_first
 CONFTEST: set_memory_uc
 CONFTEST: set_memory_array_uc
 CONFTEST: change_page_attr
 CONFTEST: pci_get_class
 CONFTEST: pci_choose_state
 CONFTEST: vm_insert_page
 CONFTEST: acpi_device_id
 CONFTEST: acquire_console_sem
 CONFTEST: console_lock
 CONFTEST: kmem_cache_create
 CONFTEST: on_each_cpu
 CONFTEST: smp_call_function
 CONFTEST: acpi_evaluate_integer
 CONFTEST: ioremap_cache
 CONFTEST: ioremap_wc
 CONFTEST: acpi_walk_namespace
 CONFTEST: pci_domain_nr
 CONFTEST: pci_dma_mapping_error
 CONFTEST: sg_alloc_table
 CONFTEST: sg_init_table
 CONFTEST: pci_get_domain_bus_and_slot
 CONFTEST: get_num_physpages
 CONFTEST: efi_enabled
 CONFTEST: proc_create_data
 CONFTEST: pde_data
 CONFTEST: proc_remove
 CONFTEST: pm_vt_switch_required
 CONFTEST: xen_ioemu_inject_msi
 CONFTEST: phys_to_dma
 CONFTEST: get_dma_ops
 CONFTEST: write_cr4
 CONFTEST: of_get_property
 CONFTEST: of_find_node_by_phandle
 CONFTEST: of_node_to_nid
 CONFTEST: pnv_pci_get_npu_dev
 CONFTEST: of_get_ibm_chip_id
 CONFTEST: for_each_online_node
 CONFTEST: node_end_pfn
 CONFTEST: pci_bus_address
 CONFTEST: pci_stop_and_remove_bus_device
 CONFTEST: pci_remove_bus_device
 CONFTEST: request_threaded_irq
 CONFTEST: register_cpu_notifier
 CONFTEST: cpuhp_setup_state
 CONFTEST: dma_map_resource
 CONFTEST: backlight_device_register
 CONFTEST: register_acpi_notifier
 CONFTEST: timer_setup
 CONFTEST: pci_enable_msix_range
 CONFTEST: compound_order
 CONFTEST: do_gettimeofday
 CONFTEST: dma_direct_map_resource
 CONFTEST: vmf_insert_pfn
 CONFTEST: remap_page_range
 CONFTEST: address_space_init_once
 CONFTEST: kbasename
 CONFTEST: fatal_signal_pending
 CONFTEST: list_cut_position
 CONFTEST: vzalloc
 CONFTEST: wait_on_bit_lock_argument_count
 CONFTEST: bitmap_clear
 CONFTEST: usleep_range
 CONFTEST: radix_tree_empty
 CONFTEST: radix_tree_replace_slot
 CONFTEST: pnv_npu2_init_context
 CONFTEST: drm_dev_unref
 CONFTEST: drm_reinit_primary_mode_group
 CONFTEST: get_user_pages_remote
 CONFTEST: get_user_pages
 CONFTEST: drm_gem_object_lookup
 CONFTEST: drm_atomic_state_ref_counting
 CONFTEST: drm_driver_has_gem_prime_res_obj
 CONFTEST: drm_atomic_helper_connector_dpms
 CONFTEST: drm_connector_funcs_have_mode_in_name
 CONFTEST: drm_framebuffer_get
 CONFTEST: drm_gem_object_get
 CONFTEST: drm_dev_put
 CONFTEST: is_export_symbol_gpl_of_node_to_nid
 CONFTEST: is_export_symbol_present_swiotlb_map_sg_attrs
 CONFTEST: is_export_symbol_present_swiotlb_dma_ops
 CONFTEST: i2c_adapter
 CONFTEST: pm_message_t
 CONFTEST: irq_handler_t
 CONFTEST: acpi_device_ops
 CONFTEST: acpi_op_remove
 CONFTEST: outer_flush_all
 CONFTEST: proc_dir_entry
 CONFTEST: scatterlist
 CONFTEST: sg_table
 CONFTEST: file_operations
 CONFTEST: vm_operations_struct
 CONFTEST: atomic_long_type
 CONFTEST: file_inode
 CONFTEST: task_struct
 CONFTEST: kuid_t
 CONFTEST: dma_ops
 CONFTEST: swiotlb_dma_ops
 CONFTEST: dma_map_ops
 CONFTEST: noncoherent_swiotlb_dma_ops
 CONFTEST: vm_fault_present
 CONFTEST: vm_fault_has_address
 CONFTEST: backlight_properties_type
 CONFTEST: vmbus_channel_has_ringbuffer_page
 CONFTEST: kmem_cache_has_kobj_remove_work
 CONFTEST: sysfs_slab_unlink
 CONFTEST: fault_flags
 CONFTEST: atomic64_type
 CONFTEST: address_space
 CONFTEST: backing_dev_info
 CONFTEST: mm_context_t
 CONFTEST: vm_ops_fault_removed_vma_arg
 CONFTEST: node_states_n_memory
 CONFTEST: drm_bus_present
 CONFTEST: drm_bus_has_bus_type
 CONFTEST: drm_bus_has_get_irq
 CONFTEST: drm_bus_has_get_name
 CONFTEST: drm_driver_has_legacy_dev_list
 CONFTEST: drm_driver_has_set_busid
 CONFTEST: drm_crtc_state_has_connectors_changed
 CONFTEST: drm_init_function_args
 CONFTEST: drm_mode_connector_list_update_has_merge_type_bits_arg
 CONFTEST: drm_helper_mode_fill_fb_struct
 CONFTEST: drm_master_drop_has_from_release_arg
 CONFTEST: drm_driver_unload_has_int_return_type
 CONFTEST: kref_has_refcount_of_type_refcount_t
 CONFTEST: drm_atomic_helper_crtc_destroy_state_has_crtc_arg
 CONFTEST: drm_crtc_helper_funcs_has_atomic_enable
 CONFTEST: drm_mode_object_find_has_file_priv_arg
 CONFTEST: dma_buf_owner
 CONFTEST: drm_connector_list_iter
 CONFTEST: drm_atomic_helper_swap_state_has_stall_arg
 CONFTEST: drm_driver_prime_flag_present
 CONFTEST: dom0_kernel_present
 CONFTEST: nvidia_vgpu_hyperv_available
 CONFTEST: nvidia_vgpu_kvm_build
 CONFTEST: nvidia_grid_build
 CONFTEST: drm_available
 CONFTEST: drm_atomic_available
 CONFTEST: is_export_symbol_gpl_refcount_inc
 CONFTEST: is_export_symbol_gpl_refcount_dec_and_test
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-frontend.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-instance.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-acpi.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-chrdev.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-cray.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-dma.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-gvi.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-i2c.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-mempool.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-mmap.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-p2p.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-pat.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-procfs.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-usermap.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-vm.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-vtophys.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/os-interface.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/os-mlock.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/os-pci.o
/var/lib/dkms/nvidia/418.113/build/nvidia/nv.c: In function 'nvidia_probe':
/var/lib/dkms/nvidia/418.113/build/nvidia/nv.c:4129:5: error: implicit declaration of function 'vga_tryget'; did you mean 'vga_get'? [-Werror=implicit-function-declaration]
     vga_tryget(VGA_DEFAULT_DEVICE, VGA_RSRC_LEGACY_MASK);
     ^~~~~~~~~~
     vga_get
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/os-registry.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/os-usermap.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-modeset-interface.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-pci-table.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-kthread-q.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-kthread-q-selftest.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-memdbg.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-ibmnpu.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-report-err.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-rsync.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-msi.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv_uvm_interface.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nvlink_linux.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/linux_nvswitch.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm_utils.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm_common.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm_linux.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/nvstatus.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/nvCpuUuid.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm8.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm8_tools.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm8_global.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm8_gpu.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm8_gpu_isr.o
cc1: some warnings being treated as errors
make[3]: *** [/usr/src/kernels/4.18.0-348.2.1.el8_5.x86_64/scripts/Makefile.build:315: /var/lib/dkms/nvidia/418.113/build/nvidia/nv.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [/usr/src/kernels/4.18.0-348.2.1.el8_5.x86_64/Makefile:1571: _module_/var/lib/dkms/nvidia/418.113/build] Error 2
make[2]: Leaving directory '/usr/src/kernels/4.18.0-348.2.1.el8_5.x86_64'
make[1]: *** [Makefile:157: sub-make] Error 2
make[1]: Leaving directory '/usr/src/kernels/4.18.0-348.2.1.el8_5.x86_64'
make: *** [Makefile:81: modules] Error 2

@klueska
Copy link
Contributor

klueska commented Jan 28, 2022

libnvidia-container-1.8.0-rc.2 is now live with some minor updates to fix some edge cases around cgroupv2 support.

Please see NVIDIA/libnvidia-container#111 (comment) for instructions on how to get access to this RC (or wait for the full release at the end of next week).

Note: This does not directly add debian11 support, but you can point to the debian10 repo and install from there for now.

@klueska
Copy link
Contributor

klueska commented Feb 4, 2022

libnvidia-container-1.8.0 with cgroupv2 support is now GA

Release notes here:
https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.8.0

@klueska
Copy link
Contributor

klueska commented Feb 4, 2022

Debian 11 support has now been added such that running the following should now work as expected:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

@klueska klueska closed this as completed Feb 4, 2022
@klueska
Copy link
Contributor

klueska commented Mar 22, 2022

The newest version of nvidia-docker should resolve these issues with ldconfig not properly setting up the library search path on debian systems before a container gets launched.

Specifically this change in libnvidia-container fixes the issue and is included as part of the latest release:
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/141

The latest release packages for the full nvidia-docker stack:

libnvidia-container1-1.9.0
libnvidia-container-tools-1.9.0
nvidia-container-toolkit-1.9.0
nvidia-container-runtime-3.9.0
nvidia-docker-2.10.0

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants