debian11 #1549

chaiyd · 2021-09-22T06:25:48Z

Please support debian11

chaiyd · 2021-09-23T02:05:57Z

Debian11 use systemd（cgroupsv2）by default.
Docker do not use privileged，will report Failed to initialize NVML: Unknown Error

redskinhu · 2021-10-09T18:11:58Z

I tried to install on Debian 11 but I got this error message:

# Unsupported distribution!
# Check https://nvidia.github.io/nvidia-docker

Checked and I see Debian 11 is not supported yet.

When we can expect it?

Linux 5.10.0-9-amd64 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux
Docker: 20.10.9
NV: 470.57.02-1

klueska · 2021-10-11T11:30:06Z

The main blocker at the moment for debian11 is cgroupv2 support. We have two ongoing efforts for this at the moment:

We are rearchitecting the NVIDIA container stack so that most of the functionality around injecting libs/devices/binaries can happen in a shim above the OCI runtime. At which point, we get cgroupv2 for free.
Direct cgroupv2 support in libnvidia-container, which is proving to be more involved than originally hoped and is currently on pause for the moment: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/96

If you need nvidia-docker to work on debian11 now you can:

Install it as if it was coming from debian10, i.e.:

distribution=$(. /etc/os-release;echo debian10)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
  sudo tee /etc/yum.repos.d/nvidia-docker.repo

Disable cgroup support in nvidia-docker, i.e.:

$ cat /etc/nvidia-container-runtime/config.toml
...
no-cgroups = true
...

Manually inject any nvidia device nodes you need access to on the docker command line, e.g.:

docker run -runtime=nvidia \
    --gpus '"device=0,1"' \
    --device /dev/nvidiactl \
    --device /dev/nvidia0 \
    --device /dev/nvidia1 \
    ubuntu:20.04 \
    nvidia-smi

Note, if you are running with Kubernetes, the equivalent of --device ... is handled for you if you run the k8s-device-plugin with the --pass-device-specs flag: https://github.com/NVIDIA/k8s-device-plugin/blob/master/cmd/nvidia-device-plugin/main.go#L61

chaiyd · 2021-10-12T07:12:27Z

The main blocker at the moment for debian11 is cgroupv2 support. We have two ongoing efforts for this at the moment:

We are rearchitecting the NVIDIA container stack so that most of the functionality around injecting libs/devices/binaries can happen in a shim above the OCI runtime. At which point, we get cgroupv2 for free.

Direct cgroupv2 support in libnvidia-container, which is proving to be more involved than originally hoped and is currently on pause for the moment: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/96

If you need nvidia-docker to work on debian11 now you can:

Install it as if it was coming from debian10, i.e.:
distribution=$(. /etc/os-release;echo debian10)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
  sudo tee /etc/yum.repos.d/nvidia-docker.repo
Disable cgroup support in nvidia-docker, i.e.:
$ cat /etc/nvidia-container-runtime/config.toml
...
no-cgroups = true
...
Manually inject any nvidia device nodes you need access to on the docker command line, e.g.:
docker run -runtime=nvidia \
    --gpus '"device=0,1"' \
    --device /dev/nvidiactl \
    --device /dev/nvidia0 \
    --device /dev/nvidia1 \
    ubuntu:20.04 \
    nvidia-smi
Note, if you are running with Kubernetes, the equivalent of --device ... is handled for you if you run the k8s-device-plugin with the --pass-device-specs flag: https://github.com/NVIDIA/k8s-device-plugin/blob/master/cmd/nvidia-device-plugin/main.go#L61

Not only disable cgroup support no-cgroups = true
At present, to run NVIDIA docker normally, debian11 needs to completely shut down cgroups V2, or docker use privileged. which is not what we want
We hope to support cgroups V2 as soon as possible

klueska · 2021-10-12T08:50:46Z

Not only disable cgroup support no-cgroups = true

At present, to run NVIDIA docker normally, debian11 needs to completely shut down cgroups V2, or docker use privileged. which is not what we want

Are you saying that setting no-cgroups = true still results in the container failing to start, or that you just don't get the devices injected in as you do when no-cgroups = false?

If the container fails to start, then that's a bug and I'd like to know what the error message is.

If it starts, but the devices are not present, then that is by design, and you will need to do manual injection of the devices as outlined in the final step of my previous comment.

It's obviously not an ideal solution, but it's a way to make things work until cgroupv2 is officially supported.

chaiyd · 2021-10-12T10:07:18Z

Not only disable cgroup support no-cgroups = true

At present, to run NVIDIA docker normally, debian11 needs to completely shut down cgroups V2, or docker use privileged. which is not what we want

Are you saying that setting no-cgroups = true still results in the container failing to start, or that you just don't get the devices injected in as you do when no-cgroups = false?

If the container fails to start, then that's a bug and I'd like to know what the error message is.

If it starts, but the devices are not present, then that is by design, and you will need to do manual injection of the devices as outlined in the final step of my previous comment.

It's obviously not an ideal solution, but it's a way to make things work until cgroupv2 is officially supported.

only setting no-cgroups = true

# docker  run --rm --gpus all --device /dev/nvidiactl --device /dev/nvidia0  ubuntu:20.04 nvidia-smi

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

setting ldconfig = "/sbin/ldconfig"

# cat /etc/nvidia-container-runtime/config.toml
...
no-cgroups = true
...
ldconfig = "/sbin/ldconfig" 
...

chaiyd · 2021-10-12T10:18:35Z

I tried to install on Debian 11 but I got this error message:
# Unsupported distribution!
# Check https://nvidia.github.io/nvidia-docker
Checked and I see Debian 11 is not supported yet.

When we can expect it?

Linux 5.10.0-9-amd64 SMP Debian 5.10.70-1 (2021-09-30) x86_64 GNU/Linux Docker: 20.10.9 NV: 470.57.02-1

You can try the following
But pay attention to the compatibility with Debian11

# distribution=$(. /etc/os-release;echo debian10) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# apt update 
# apt install nvidia-docker2

redskinhu · 2021-10-12T20:56:38Z

Sounds good, worth to try. Tomorrow.
I thinked about something like this, but I did't know/find what I need to substitute to $distribution.

Thx

redskinhu · 2021-10-20T18:02:30Z

Hi

apt install nvidia-docker2 installing it without any error.
But when I want to do the next step on my setup ( blakeblackshear/frigate#1847 (comment) )

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

I got this error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: cgroup subsystem devices not found: unknown.

Any idea?

klueska · 2021-10-20T18:45:46Z

Yes, this is due to libnvidia-container not supporting cgroupv2 as discussed here:
#1549 (comment)

redskinhu · 2021-10-20T19:12:46Z

Thanks for the info.

…g cgroupv2 support. - NVIDIA/nvidia-docker#1549

Lecrapouille · 2021-10-26T21:18:04Z

@redskinhu

apt install nvidia-docker2 installing it without any error.

How have you set your repo list ? I have non-free list but nvidia-docker2 is unknown on my debian-11

elezar · 2021-10-27T08:28:06Z

@Lecrapouille you should be able to download / install the Debian 10 packages from the repository.

chaiyd · 2021-10-27T09:20:38Z

@redskinhu

apt install nvidia-docker2 installing it without any error.

How have you set your repo list ? I have non-free list but nvidia-docker2 is unknown on my debian-11

# distribution=$(. /etc/os-release;echo debian10) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# apt update 
# apt install nvidia-docker2

redskinhu · 2021-10-27T16:31:29Z

Hello

I did the install based on this guide:
blakeblackshear/frigate#1847 (comment)

Lecrapouille · 2021-10-27T18:40:47Z

Thanks! I finally get gpu info when applying:

PS: @redskinhu your link is not working

redskinhu · 2021-10-29T19:36:09Z

Corrected, Thx

frederico-klein · 2021-11-17T19:42:50Z

Not only disable cgroup support no-cgroups = true

At present, to run NVIDIA docker normally, debian11 needs to completely shut down cgroups V2, or docker use privileged. which is not what we want

Are you saying that setting no-cgroups = true still results in the container failing to start, or that you just don't get the devices injected in as you do when no-cgroups = false?
If the container fails to start, then that's a bug and I'd like to know what the error message is.
If it starts, but the devices are not present, then that is by design, and you will need to do manual injection of the devices as outlined in the final step of my previous comment.
It's obviously not an ideal solution, but it's a way to make things work until cgroupv2 is officially supported.
* only setting `no-cgroups = true`
# docker  run --rm --gpus all --device /dev/nvidiactl --device /dev/nvidia0  ubuntu:20.04 nvidia-smi

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
* setting `ldconfig = "/sbin/ldconfig" `
# cat /etc/nvidia-container-runtime/config.toml
...
no-cgroups = true
...
ldconfig = "/sbin/ldconfig" 
...

Installing the debian10 nvidia docker2 AND

editing /etc/nvidia-container-runtime/config.toml

WITH the

 --device /dev/nvidiactl --device /dev/nvidia0

(and restarting the docker service ) did it for me

chaiyd · 2021-11-18T06:23:48Z

Not only disable cgroup support no-cgroups = true

At present, to run NVIDIA docker normally, debian11 needs to completely shut down cgroups V2, or docker use privileged. which is not what we want

Are you saying that setting no-cgroups = true still results in the container failing to start, or that you just don't get the devices injected in as you do when no-cgroups = false?
If the container fails to start, then that's a bug and I'd like to know what the error message is.
If it starts, but the devices are not present, then that is by design, and you will need to do manual injection of the devices as outlined in the final step of my previous comment.
It's obviously not an ideal solution, but it's a way to make things work until cgroupv2 is officially supported.
* only setting `no-cgroups = true`
# docker  run --rm --gpus all --device /dev/nvidiactl --device /dev/nvidia0  ubuntu:20.04 nvidia-smi

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
* setting `ldconfig = "/sbin/ldconfig" `
# cat /etc/nvidia-container-runtime/config.toml
...
no-cgroups = true
...
ldconfig = "/sbin/ldconfig" 
...
Installing the debian10 nvidia docker2 AND

editing /etc/nvidia-container-runtime/config.toml

WITH the
 --device /dev/nvidiactl --device /dev/nvidia0 
(and restarting the docker service ) did it for me

@frederico-klein Yeah, this is no problem, but it is not an ideal solution. we need a better solution.

galosre · 2021-11-20T08:22:13Z

not sure about this
editing /etc/nvidia-container-runtime/config.toml

with the

--device /dev/nvidiactl --device /dev/nvidia0
so I had this in editing /etc/nvidia-container-runtime/config.toml
and also got no-cgroups = true
...
ldconfig = "/sbin/ldconfig"
but in portainer runtime dropdown I do not have Nvidia ?

chaiyd · 2021-11-20T12:45:45Z

not sure about this
editing /etc/nvidia-container-runtime/config.toml

with the

--device /dev/nvidiactl --device /dev/nvidia0
so I had this in editing /etc/nvidia-container-runtime/config.toml
and also got no-cgroups = true
...
ldconfig = "/sbin/ldconfig"
but in portainer runtime dropdown I do not have Nvidia ?

Are you sure you have installed cuda or Nvidia-derive ？

galosre · 2021-11-21T01:18:13Z

NVIDIA-SMI 495.44 is running what am I missing?
ok got nvidia in runtime portainer but still not working
docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smi
Sun Nov 21 03:25:37 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44 Driver Version: 495.44 CUDA Version: 11.5 |
but if I add --device /dev/nvidiactl --device /dev/nvidia0 to config.toml
I got docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: couldn't open default configuration file: Near line 17 (last key parsed 'nvidia-container-cli.--device'): expected key separator '=', but got '/' instead: unknown.

galosre · 2021-11-21T22:06:28Z

Running my container in privileged mode got it working

chaiyd · 2021-11-22T03:36:39Z

Running my container in privileged mode got it working

ok

galosre · 2021-12-02T06:59:10Z

Today my system was updated with
nvidia-container-toolkit:amd64 1.6.0-1 1.7.0-1
libnvidia-container-tools:amd64 1.6.0-1 1.7.0-1
libnvidia-container1:amd64 1.6.0-1 1.7.0-1
So now how to use those package?

klueska · 2021-12-03T13:51:57Z

As a side note -- we plan to have an RC out with cgroupv2 support for nvidia-docker shortly before Christmas.

Here is the MR chain:
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/113
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/114
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/115
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/116
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/117

chenhengqi · 2021-12-03T13:52:44Z

Have to mention here, I use cgroup v2 :)

$ ctr run --rm --runtime=io.containerd.runtime.v1.linux --env NVIDIA_VISIBLE_DEVICES=0 nvcr.io/nvidia/cuda:11.4.2-base-ubuntu20.04 test nvidia-smi -L
ctr: cgroups: cgroup mountpoint does not exist: unknown

chenhengqi · 2021-12-03T13:54:09Z

As a side note -- we plan to have an RC out with cgroupv2 support for nvidia-docker shortly before Christmas.

Here is the MR chain: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/113 https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/114 https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/115 https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/116 https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/117

Good to know. Maybe I can help test this.

klueska · 2021-12-03T13:56:12Z

Have to mention here, I use cgroup v2 :)

$ ctr run --rm --runtime=io.containerd.runtime.v1.linux --env NVIDIA_VISIBLE_DEVICES=0 nvcr.io/nvidia/cuda:11.4.2-base-ubuntu20.04 test nvidia-smi -L
ctr: cgroups: cgroup mountpoint does not exist: unknown

Yes, I realize this, which is why you are wanting to set no-cgroup = true in the config.toml.
The fact that you were able to run the above and get the error you see means that you are finally now running through the runtime (and not the CLI directly) so you should be able to now put the workaround in place and get things to work.

chenhengqi · 2021-12-06T04:05:29Z

@klueska The following workaround does not work because the --no-cgroups option only apply to configure subcommand, NOT the cli itself.

It's a hack, but if you want this to work with ctr you will need to wrap the nvidia-container-cli as

$ mv /usr/bin/nvidia-container-cli /usr/bin/nvidia-container-cli.real
... create a wrapper /usr/bin/nvidia-container-cli...

$ cat /usr/bin/nvidia-container-cli
#!/usr/bin/env bash
exec /usr/bin/nvidia-container-cli.real --no-cgroups "$@"

klueska · 2021-12-06T10:52:43Z

You're right, I didn't test my suggestion (since it really is a pretty brutal hack). You would need to do something more sophisticated like:

#!/usr/bin/env bash

if [[ "${*}" != *"configure"* ]]; then
        exec /usr/bin/nvidia-container-cli.real "$@"
fi

exec /usr/bin/nvidia-container-cli.real "${@:1:$#-1}" --no-cgroups "${@: -1}"

(which I did test this time).

klueska · 2021-12-08T17:34:03Z

We now have an RC of libnvidia-container out that adds support for cgroupv2.

If you would like to try it out, make sure and add the experimental repo to your apt sources and install the latest packages:

For DEBs

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/libnvidia-container.list
sudo apt-get update

sudo apt-get install -y libnvidia-container-tools libnvidia-container1

For RPMs

sudo yum-config-manager --enable libnvidia-container-experimental

sudo yum install -y libnvidia-container-tools libnvidia-container1

chenhengqi · 2021-12-09T03:41:37Z

CentOS 8 + Cgroup v2 + containerd

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
>    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
$ dnf config-manager --set-enabled libnvidia-container-experimental
$ dnf install -y libnvidia-container-tools libnvidia-container1 nvidia-container-runtime

The following commands now works

$ ctr image pull docker.io/nvidia/cuda:11.0-base
$ ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0-base cuda-11.0-base nvidia-smi

Cheers ! 💯

chaiyd · 2021-12-09T04:50:38Z

CentOS 8 + Cgroup v2 + containerd

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
>    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
$ dnf config-manager --set-enabled libnvidia-container-experimental
$ dnf install -y libnvidia-container-tools libnvidia-container1 nvidia-container-runtime

The following commands now works

$ ctr image pull docker.io/nvidia/cuda:11.0-base
$ ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:11.0-base cuda-11.0-base nvidia-smi

Cheers ! 💯

Debian11 doesn't fragrant?

klueska · 2021-12-09T09:32:20Z

This RC was tested almost exclusively on a Debian 11 system, so I'd be surprised if it's not working there.

That said, we don't yet officially support Debian 11, so you will need to add the apt repo for Debian 10:

distribution=$(. /etc/os-release;echo debian10)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
  sudo tee /etc/yum.repos.d/nvidia-docker.repo

And (as mentioned above) to get access to the RC package, you will need to enable the experimental repo:

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/libnvidia-container.list
sudo apt-get update

And install it:

sudo apt-get  install -y libnvidia-container-tools libnvidia-container1

You should then see the following versions installed:

$ sudo dpkg --list libnvidia-container1 libnvidia-container-tools
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                     Version                   Architecture              Description
+++-========================================-=========================-=========================-=====================================================================================
ii  libnvidia-container-tools                1.8.0~rc.1-1              amd64                     NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64               1.8.0~rc.1-1              amd64                     NVIDIA container runtime library

chenhengqi · 2021-12-09T09:52:33Z

When will there be a GA release of this new libnvidia-container ?

klueska · 2021-12-09T10:01:31Z

Sometime in the new year after it has been thoroughly tested and certified.

chaiyd · 2021-12-09T10:20:34Z

This is very good news，looking forward to GA. At the same time, I also hope to consider supporting Debian ARM.

klueska · 2021-12-09T10:48:33Z

In general this should also work on ARM, though you'll likely need to set the distribution to ubuntu20.04 to get the packages you want since we don't yet "officially support" debian systems on ARM. (All "officially supporting" means is that we don't do thorough testing on these platforms, but often the same packages from other distributions work there).

From my testing yesterday on a g5g.xlarge instance on EC2 with ubuntu 20.04:

Note: The initial setup tests against cgroupv1 which is why the existing stable releases work.

Setup

$ history
    1  wget https://us.download.nvidia.com/tesla/470.82.01/NVIDIA-Linux-aarch64-470.82.01.run
    2  chmod a+x NVIDIA-Linux-aarch64-470.82.01.run
    3  sudo apt-get update
    4  sudo apt-get install build-essential linux-headers-5.11.0-1020-aws
    5  sudo ./NVIDIA-Linux-aarch64-470.82.01.run
    6  curl https://get.docker.com | sh   && sudo systemctl --now enable docker
    7  distribution=$(. /etc/os-release;echo $ID$VERSION_ID)    && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    8  sudo apt-get update
    9  sudo apt-get install -y nvidia-docker2
   10  sudo systemctl restart docker

Baseline test with latest stable `libnvidia-container1-1.7.0` and `libnvidia-container-tools-1.7.0`:

$ sudo docker run --rm --gpus all nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi
Wed Dec  8 11:37:16 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T4G          Off  | 00000000:00:1F.0 Off |                    0 |
| N/A   70C    P0    19W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Install new `libnvidia-container1-1.8.0~rc.1` and `libnvidia-container-tools-1.8.0~rc.1`:

$ sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/libnvidia-container.list
$ sudo apt-get update
$ sudo apt-get install -y libnvidia-container-tools libnvidia-container1

Same baseline test with newer packages:

$ sudo docker run --rm --gpus all nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi
Wed Dec  8 11:41:18 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T4G          Off  | 00000000:00:1F.0 Off |                    0 |
| N/A   53C    P0    17W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Remove libnvidia-container-go.so.1.8.0 and rerun (to verify we are going through the new nvcgo implementation of cgroup manipulation):

$ sudo rm -rf /usr/lib/aarch64-linux-gnu/libnvidia-container-go.so.1.8.0

Rerun baseline test

$ sudo docker run --rm --gpus all nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: load library failed: libnvidia-container-go.so.1: cannot open shared object file: no such file or directory: unknown.

Restore libnvidia-container-go.so.1.8.0:

$ sudo apt-get --reinstall install libnvidia-container1

Run a series of tests on other container OSs:

$ sudo docker run --rm --gpus all nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi -L
GPU 0: NVIDIA T4G (UUID: GPU-ac104454-a5f6-5732-dc35-225062fdbeb7)

$ sudo docker run --rm --gpus all ubuntu:18.04 nvidia-smi -L
GPU 0: NVIDIA T4G (UUID: GPU-ac104454-a5f6-5732-dc35-225062fdbeb7)

$ sudo docker run --rm --gpus all centos:7 nvidia-smi -L
GPU 0: NVIDIA T4G (UUID: GPU-ac104454-a5f6-5732-dc35-225062fdbeb7)

Follow the instructions at the following link to enable `cgroupv2`:

https://rootlesscontaine.rs/getting-started/common/cgroup2

Reboot the machine and enable logging for the toolkit:

$ sudo vi /etc/nvidia-container-runtime/config.toml
...
- #debug = "/var/log/nvidia-container-toolkit.log"
+ debug = "/var/log/nvidia-container-toolkit.log"
...

Rerun the three tests from above:

$ sudo docker run --rm --gpus all nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi -L
GPU 0: NVIDIA T4G (UUID: GPU-ac104454-a5f6-5732-dc35-225062fdbeb7)

$ sudo docker run --rm --gpus all ubuntu:18.04 nvidia-smi -L
GPU 0: NVIDIA T4G (UUID: GPU-ac104454-a5f6-5732-dc35-225062fdbeb7)

$ sudo docker run --rm --gpus all centos:7 nvidia-smi -L
GPU 0: NVIDIA T4G (UUID: GPU-ac104454-a5f6-5732-dc35-225062fdbeb7)

Verify in the logs that these were run with `cgroupv2` detected:

$ grep "detected cgroup" /var/log/nvidia-container-toolkit.log
I1208 11:59:35.130746 1782 nvc_container.c:272] detected cgroupv2
I1208 11:59:42.452570 1926 nvc_container.c:272] detected cgroupv2
I1208 11:59:49.526292 2071 nvc_container.c:272] detected cgroupv2

chaiyd · 2021-12-09T11:00:12Z

After I have the arm device, I'll try again

TaridaGeorge · 2021-12-09T17:44:37Z

sudo sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/libnvidia-container.list

sed: can't read /etc/apt/sources.list.d/libnvidia-container.list: No such file or directory

what I'm doing wrong?

klueska · 2021-12-09T17:46:50Z

You might instead have an nvidia-docker.list or a nvidia-container-runtime.list file instead of a libnvidia-container.list file.

The command will be the same just swap out the file name at the end.

TaridaGeorge · 2021-12-09T17:48:44Z

Yup. Thank you!

TaridaGeorge · 2021-12-11T20:41:58Z

If I use the experimental versions do I still need to set the no-cgroups to true and the ldconfig to "/sbin/ldconfig" in the /etc/nvidia-container-runtime/config.toml? Also do I still need to Manually inject any nvidia device nodes you need access to on the docker command line, e.g.:?

chenhengqi · 2021-12-12T02:15:00Z

If I use the experimental versions do I still need to set the no-cgroups to true and the ldconfig to "/sbin/ldconfig" in the /etc/nvidia-container-runtime/config.toml? Also do I still need to Manually inject any nvidia device nodes you need access to on the docker command line, e.g.:?

FYI, on CentOS 8, no extra configs are needed. Everything works fine.

chenhengqi · 2021-12-14T11:58:29Z

@klueska @elezar

Hello, do you guys happen to know how to install NVIDIA drivers (version 418) on CentOS 8 ?

The installer (from http://download.nvidia.com/XFree86/Linux-x86_64/418.113/NVIDIA-Linux-x86_64-418.113-no-compat32.run ) failed with:

$ ./NVIDIA-Linux-x86_64-418.113-no-compat32.run --ui=none --disable-nouveau --no-install-libglvnd --dkms -s
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 418.113........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: One or more modprobe configuration files to disable Nouveau are already present at:
         /usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf, /etc/modprobe.d/nvidia-installer-disable-nouveau.conf.
         Please be sure you have rebooted your system since these files were written.  If you have rebooted, then Nouveau
         may be enabled for other reasons, such as being included in the system initial ramdisk or in your X configuration
         file.  Please consult the NVIDIA driver README and your Linux distribution's documentation for details on how to
         correctly disable the Nouveau kernel driver.


WARNING: nvidia-installer was forced to guess the X library path '/usr/lib64' and X module path '/usr/lib64/xorg/modules';
         these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install
         the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.


ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 418.113 -k 4.18.0-348.2.1.el8_5.x86_64`: 
       Building module:
       cleaning build area...
       'make' -j20 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=4.18.0-348.2.1.el8_5.x86_64 IGNORE_CC_MISMATCH=''
       modules....(bad exit status: 2)
       Error! Bad return status for module build on kernel: 4.18.0-348.2.1.el8_5.x86_64 (x86_64)
       Consult /var/lib/dkms/nvidia/418.113/build/make.log for more information.


ERROR: Failed to install the kernel module through DKMS. No kernel module was installed; please try installing again
       without DKMS, or check the DKMS logs for more information.


ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find
       suggestions on fixing installation problems in the README available on the Linux driver download page at
       www.nvidia.com.

DKMS make.log for nvidia-418.113 for kernel 4.18.0-348.2.1.el8_5.x86_64 (x86_64)
Tue Dec 14 19:54:24 CST 2021
make[1]: Entering directory '/usr/src/kernels/4.18.0-348.2.1.el8_5.x86_64'
make[2]: Entering directory '/usr/src/kernels/4.18.0-348.2.1.el8_5.x86_64'
  SYMLINK /var/lib/dkms/nvidia/418.113/build/nvidia/nv-kernel.o
  SYMLINK /var/lib/dkms/nvidia/418.113/build/nvidia-modeset/nv-modeset-kernel.o
 CONFTEST: INIT_WORK
 CONFTEST: remap_pfn_range
 CONFTEST: hash__remap_4k_pfn
 CONFTEST: follow_pfn
 CONFTEST: vmap
 CONFTEST: set_pages_uc
 CONFTEST: list_is_first
 CONFTEST: set_memory_uc
 CONFTEST: set_memory_array_uc
 CONFTEST: change_page_attr
 CONFTEST: pci_get_class
 CONFTEST: pci_choose_state
 CONFTEST: vm_insert_page
 CONFTEST: acpi_device_id
 CONFTEST: acquire_console_sem
 CONFTEST: console_lock
 CONFTEST: kmem_cache_create
 CONFTEST: on_each_cpu
 CONFTEST: smp_call_function
 CONFTEST: acpi_evaluate_integer
 CONFTEST: ioremap_cache
 CONFTEST: ioremap_wc
 CONFTEST: acpi_walk_namespace
 CONFTEST: pci_domain_nr
 CONFTEST: pci_dma_mapping_error
 CONFTEST: sg_alloc_table
 CONFTEST: sg_init_table
 CONFTEST: pci_get_domain_bus_and_slot
 CONFTEST: get_num_physpages
 CONFTEST: efi_enabled
 CONFTEST: proc_create_data
 CONFTEST: pde_data
 CONFTEST: proc_remove
 CONFTEST: pm_vt_switch_required
 CONFTEST: xen_ioemu_inject_msi
 CONFTEST: phys_to_dma
 CONFTEST: get_dma_ops
 CONFTEST: write_cr4
 CONFTEST: of_get_property
 CONFTEST: of_find_node_by_phandle
 CONFTEST: of_node_to_nid
 CONFTEST: pnv_pci_get_npu_dev
 CONFTEST: of_get_ibm_chip_id
 CONFTEST: for_each_online_node
 CONFTEST: node_end_pfn
 CONFTEST: pci_bus_address
 CONFTEST: pci_stop_and_remove_bus_device
 CONFTEST: pci_remove_bus_device
 CONFTEST: request_threaded_irq
 CONFTEST: register_cpu_notifier
 CONFTEST: cpuhp_setup_state
 CONFTEST: dma_map_resource
 CONFTEST: backlight_device_register
 CONFTEST: register_acpi_notifier
 CONFTEST: timer_setup
 CONFTEST: pci_enable_msix_range
 CONFTEST: compound_order
 CONFTEST: do_gettimeofday
 CONFTEST: dma_direct_map_resource
 CONFTEST: vmf_insert_pfn
 CONFTEST: remap_page_range
 CONFTEST: address_space_init_once
 CONFTEST: kbasename
 CONFTEST: fatal_signal_pending
 CONFTEST: list_cut_position
 CONFTEST: vzalloc
 CONFTEST: wait_on_bit_lock_argument_count
 CONFTEST: bitmap_clear
 CONFTEST: usleep_range
 CONFTEST: radix_tree_empty
 CONFTEST: radix_tree_replace_slot
 CONFTEST: pnv_npu2_init_context
 CONFTEST: drm_dev_unref
 CONFTEST: drm_reinit_primary_mode_group
 CONFTEST: get_user_pages_remote
 CONFTEST: get_user_pages
 CONFTEST: drm_gem_object_lookup
 CONFTEST: drm_atomic_state_ref_counting
 CONFTEST: drm_driver_has_gem_prime_res_obj
 CONFTEST: drm_atomic_helper_connector_dpms
 CONFTEST: drm_connector_funcs_have_mode_in_name
 CONFTEST: drm_framebuffer_get
 CONFTEST: drm_gem_object_get
 CONFTEST: drm_dev_put
 CONFTEST: is_export_symbol_gpl_of_node_to_nid
 CONFTEST: is_export_symbol_present_swiotlb_map_sg_attrs
 CONFTEST: is_export_symbol_present_swiotlb_dma_ops
 CONFTEST: i2c_adapter
 CONFTEST: pm_message_t
 CONFTEST: irq_handler_t
 CONFTEST: acpi_device_ops
 CONFTEST: acpi_op_remove
 CONFTEST: outer_flush_all
 CONFTEST: proc_dir_entry
 CONFTEST: scatterlist
 CONFTEST: sg_table
 CONFTEST: file_operations
 CONFTEST: vm_operations_struct
 CONFTEST: atomic_long_type
 CONFTEST: file_inode
 CONFTEST: task_struct
 CONFTEST: kuid_t
 CONFTEST: dma_ops
 CONFTEST: swiotlb_dma_ops
 CONFTEST: dma_map_ops
 CONFTEST: noncoherent_swiotlb_dma_ops
 CONFTEST: vm_fault_present
 CONFTEST: vm_fault_has_address
 CONFTEST: backlight_properties_type
 CONFTEST: vmbus_channel_has_ringbuffer_page
 CONFTEST: kmem_cache_has_kobj_remove_work
 CONFTEST: sysfs_slab_unlink
 CONFTEST: fault_flags
 CONFTEST: atomic64_type
 CONFTEST: address_space
 CONFTEST: backing_dev_info
 CONFTEST: mm_context_t
 CONFTEST: vm_ops_fault_removed_vma_arg
 CONFTEST: node_states_n_memory
 CONFTEST: drm_bus_present
 CONFTEST: drm_bus_has_bus_type
 CONFTEST: drm_bus_has_get_irq
 CONFTEST: drm_bus_has_get_name
 CONFTEST: drm_driver_has_legacy_dev_list
 CONFTEST: drm_driver_has_set_busid
 CONFTEST: drm_crtc_state_has_connectors_changed
 CONFTEST: drm_init_function_args
 CONFTEST: drm_mode_connector_list_update_has_merge_type_bits_arg
 CONFTEST: drm_helper_mode_fill_fb_struct
 CONFTEST: drm_master_drop_has_from_release_arg
 CONFTEST: drm_driver_unload_has_int_return_type
 CONFTEST: kref_has_refcount_of_type_refcount_t
 CONFTEST: drm_atomic_helper_crtc_destroy_state_has_crtc_arg
 CONFTEST: drm_crtc_helper_funcs_has_atomic_enable
 CONFTEST: drm_mode_object_find_has_file_priv_arg
 CONFTEST: dma_buf_owner
 CONFTEST: drm_connector_list_iter
 CONFTEST: drm_atomic_helper_swap_state_has_stall_arg
 CONFTEST: drm_driver_prime_flag_present
 CONFTEST: dom0_kernel_present
 CONFTEST: nvidia_vgpu_hyperv_available
 CONFTEST: nvidia_vgpu_kvm_build
 CONFTEST: nvidia_grid_build
 CONFTEST: drm_available
 CONFTEST: drm_atomic_available
 CONFTEST: is_export_symbol_gpl_refcount_inc
 CONFTEST: is_export_symbol_gpl_refcount_dec_and_test
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-frontend.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-instance.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-acpi.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-chrdev.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-cray.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-dma.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-gvi.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-i2c.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-mempool.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-mmap.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-p2p.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-pat.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-procfs.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-usermap.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-vm.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-vtophys.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/os-interface.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/os-mlock.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/os-pci.o
/var/lib/dkms/nvidia/418.113/build/nvidia/nv.c: In function 'nvidia_probe':
/var/lib/dkms/nvidia/418.113/build/nvidia/nv.c:4129:5: error: implicit declaration of function 'vga_tryget'; did you mean 'vga_get'? [-Werror=implicit-function-declaration]
     vga_tryget(VGA_DEFAULT_DEVICE, VGA_RSRC_LEGACY_MASK);
     ^~~~~~~~~~
     vga_get
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/os-registry.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/os-usermap.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-modeset-interface.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-pci-table.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-kthread-q.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-kthread-q-selftest.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-memdbg.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-ibmnpu.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-report-err.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-rsync.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv-msi.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nv_uvm_interface.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/nvlink_linux.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia/linux_nvswitch.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm_utils.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm_common.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm_linux.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/nvstatus.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/nvCpuUuid.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm8.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm8_tools.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm8_global.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm8_gpu.o
  CC [M]  /var/lib/dkms/nvidia/418.113/build/nvidia-uvm/uvm8_gpu_isr.o
cc1: some warnings being treated as errors
make[3]: *** [/usr/src/kernels/4.18.0-348.2.1.el8_5.x86_64/scripts/Makefile.build:315: /var/lib/dkms/nvidia/418.113/build/nvidia/nv.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [/usr/src/kernels/4.18.0-348.2.1.el8_5.x86_64/Makefile:1571: _module_/var/lib/dkms/nvidia/418.113/build] Error 2
make[2]: Leaving directory '/usr/src/kernels/4.18.0-348.2.1.el8_5.x86_64'
make[1]: *** [Makefile:157: sub-make] Error 2
make[1]: Leaving directory '/usr/src/kernels/4.18.0-348.2.1.el8_5.x86_64'
make: *** [Makefile:81: modules] Error 2

klueska · 2022-01-28T17:23:24Z

libnvidia-container-1.8.0-rc.2 is now live with some minor updates to fix some edge cases around cgroupv2 support.

Please see NVIDIA/libnvidia-container#111 (comment) for instructions on how to get access to this RC (or wait for the full release at the end of next week).

Note: This does not directly add debian11 support, but you can point to the debian10 repo and install from there for now.

klueska · 2022-02-04T15:00:29Z

libnvidia-container-1.8.0 with cgroupv2 support is now GA

Release notes here:
https://github.com/NVIDIA/libnvidia-container/releases/tag/v1.8.0

klueska · 2022-02-04T15:56:56Z

Debian 11 support has now been added such that running the following should now work as expected:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

klueska · 2022-03-22T11:28:50Z

The newest version of nvidia-docker should resolve these issues with ldconfig not properly setting up the library search path on debian systems before a container gets launched.

Specifically this change in libnvidia-container fixes the issue and is included as part of the latest release:
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/141

The latest release packages for the full nvidia-docker stack:

libnvidia-container1-1.9.0
libnvidia-container-tools-1.9.0
nvidia-container-toolkit-1.9.0
nvidia-container-runtime-3.9.0
nvidia-docker-2.10.0

This was referenced Oct 11, 2021

cgroup issue with nvidia container runtime on Debian testing #1447

Closed

Manual setup of nvidia-docker2 for Debian 11? #1537

Closed

joelluijmes mentioned this issue Oct 11, 2021

Could we support nvidia-docker? (running docker image with GPU driver) nuclio/nuclio#1781

Closed

xkszltl added a commit to xkszltl/Roaster that referenced this issue Oct 23, 2021

nvidia-container-runtime does not work on Debian 11 yet due to missin…

920f12c

…g cgroupv2 support. - NVIDIA/nvidia-docker#1549

klueska mentioned this issue Oct 25, 2021

[Error (docker)]: response from daemon: Unknown runtime specified nvidia AND could not select device driver "" with capabilities: [[gpu]]. rapidsai/node#324

Open

f1yankees mentioned this issue Dec 17, 2021

Packages for Debian 11 Bullseye #1581

Closed

chaiyd mentioned this issue Jan 12, 2022

support Debian stable #1589

Closed

kuang-da mentioned this issue Jan 30, 2022

docker-compose build OCI runtime create failed kuang-da/nvidia-pytorch-docker-env#2

Open

klueska closed this as completed Feb 4, 2022

ReyRen mentioned this issue May 7, 2022

Failed to initialize NVML: Unknown Error without any kublet update(cpu-manager-policy is default none) #1618

Closed

9 tasks

debian11 #1549

debian11 #1549

Comments

chaiyd commented Sep 22, 2021

chaiyd commented Sep 23, 2021

redskinhu commented Oct 9, 2021

klueska commented Oct 11, 2021

chaiyd commented Oct 12, 2021

klueska commented Oct 12, 2021

chaiyd commented Oct 12, 2021

chaiyd commented Oct 12, 2021

redskinhu commented Oct 12, 2021 • edited Loading

redskinhu commented Oct 20, 2021

klueska commented Oct 20, 2021

redskinhu commented Oct 20, 2021

Lecrapouille commented Oct 26, 2021

elezar commented Oct 27, 2021

chaiyd commented Oct 27, 2021

redskinhu commented Oct 27, 2021 • edited Loading

Lecrapouille commented Oct 27, 2021 • edited Loading

redskinhu commented Oct 29, 2021

frederico-klein commented Nov 17, 2021

chaiyd commented Nov 18, 2021

galosre commented Nov 20, 2021

chaiyd commented Nov 20, 2021

galosre commented Nov 21, 2021 • edited Loading

galosre commented Nov 21, 2021

chaiyd commented Nov 22, 2021

galosre commented Dec 2, 2021 • edited Loading

klueska commented Dec 3, 2021

chenhengqi commented Dec 3, 2021

chenhengqi commented Dec 3, 2021

klueska commented Dec 3, 2021

chenhengqi commented Dec 6, 2021

klueska commented Dec 6, 2021

klueska commented Dec 8, 2021

For DEBs

For RPMs

chenhengqi commented Dec 9, 2021

chaiyd commented Dec 9, 2021

klueska commented Dec 9, 2021 • edited Loading

chenhengqi commented Dec 9, 2021

klueska commented Dec 9, 2021

chaiyd commented Dec 9, 2021

klueska commented Dec 9, 2021

Setup

Baseline test with latest stable libnvidia-container1-1.7.0 and libnvidia-container-tools-1.7.0:

Install new libnvidia-container1-1.8.0~rc.1 and libnvidia-container-tools-1.8.0~rc.1:

Same baseline test with newer packages:

Remove libnvidia-container-go.so.1.8.0 and rerun (to verify we are going through the new nvcgo implementation of cgroup manipulation):

Rerun baseline test

Restore libnvidia-container-go.so.1.8.0:

Run a series of tests on other container OSs:

Follow the instructions at the following link to enable cgroupv2:

Reboot the machine and enable logging for the toolkit:

Rerun the three tests from above:

Verify in the logs that these were run with cgroupv2 detected:

chaiyd commented Dec 9, 2021

TaridaGeorge commented Dec 9, 2021

klueska commented Dec 9, 2021

TaridaGeorge commented Dec 9, 2021

TaridaGeorge commented Dec 11, 2021 • edited Loading

chenhengqi commented Dec 12, 2021

chenhengqi commented Dec 14, 2021

klueska commented Jan 28, 2022

klueska commented Feb 4, 2022

klueska commented Feb 4, 2022

klueska commented Mar 22, 2022

redskinhu commented Oct 12, 2021 •

edited

Loading

redskinhu commented Oct 27, 2021 •

edited

Loading

Lecrapouille commented Oct 27, 2021 •

edited

Loading

galosre commented Nov 21, 2021 •

edited

Loading

galosre commented Dec 2, 2021 •

edited

Loading

klueska commented Dec 9, 2021 •

edited

Loading

Baseline test with latest stable `libnvidia-container1-1.7.0` and `libnvidia-container-tools-1.7.0`:

Install new `libnvidia-container1-1.8.0~rc.1` and `libnvidia-container-tools-1.8.0~rc.1`:

Follow the instructions at the following link to enable `cgroupv2`:

Verify in the logs that these were run with `cgroupv2` detected:

TaridaGeorge commented Dec 11, 2021 •

edited

Loading