Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-runtime fails to run containers with the -it flag #322400

Closed
arximboldi opened this issue Jun 25, 2024 · 40 comments · Fixed by #331071
Closed

nvidia-container-runtime fails to run containers with the -it flag #322400

arximboldi opened this issue Jun 25, 2024 · 40 comments · Fixed by #331071
Labels
0.kind: bug Something is broken

Comments

@arximboldi
Copy link
Contributor

Describe the bug

When passing --runtime nvidia to docker and the -it flag is passed, the following error is output:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: /nix/store/va74ykggqzmamwh2aj39fxlwzf8csw6s-nvidia-docker/bin/nvidia-container-runtime did not terminate successfully: exit status 125: unknown flag: --root
See 'docker --help'.
...

Steps To Reproduce

Steps to reproduce the behavior:

  1. Run:
docker run --runtime nvidia --rm --gpus all -ti docker.io/nvidia/vulkan:1.3-470 vulkaninfo

Expected behavior

The container runs properly.

Notify maintainers

@averelld (author of commit that added the virtualisation.docker.enableNvidia option)
@cpcloud (nvidia-docker maintainer)

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

[user@system:~]$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 6.6.32, NixOS, 24.05 (Uakari), 24.05.675.805a384895c6`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.18.2`
 - channels(root): `"nixos-24.05, nixos-21.11-21.11, nixos-23.05-23.05, nixos-23.11-23.11, nixos-24.05-24.05, nixos-unstable, nixpkgs"`
 - channels(raskolnikov): `"nixpkgs"`
 - nixpkgs: `/home/raskolnikov/.nix-defexpr/channels/nixpkgs`

Add a 👍 reaction to issues you find important.

@arximboldi arximboldi added the 0.kind: bug Something is broken label Jun 25, 2024
@arximboldi
Copy link
Contributor Author

Pinging @aaronmondal as you seemed to have the same issue here, not sure how it got resolved if at all?
#278969 (comment)

@arximboldi
Copy link
Contributor Author

arximboldi commented Jun 25, 2024

Tried podman as workaround, got a different error:

podman run --runtime nvidia --rm --gpus all docker.io/nvidia/vulkan:1.3-470 vulkaninfo
WARN[0000] Using cgroups-v1 which is deprecated in favor of cgroups-v2 with Podman v5 and will be removed in a future version. Set environment variable `PODMAN_IGNORE_CGROUPSV1_WARNING` to hide this warning. 
Error: nvidia: time="2024-06-25T14:32:14+02:00" level=error msg="runc create failed: unable to start container process: error during container init: error mounting \"/nix/store/2vh0agm3gq8nqp9ql2wsg2kbj7nsihp7-nvidia-x11-470.239.06-6.6.32-bin/bin/nvidia-powerd\" to rootfs at \"/usr/bin/nvidia-powerd\": stat /nix/store/2vh0agm3gq8nqp9ql2wsg2kbj7nsihp7-nvidia-x11-470.239.06-6.6.32-bin/bin/nvidia-powerd: no such file or directory": OCI runtime attempted to invoke a command that was not found

Also for completenes:

  virtualisation.docker.enable = true;
  virtualisation.docker.enableNvidia = true;
  virtualisation.podman.enable = true;
  virtualisation.podman.enableNvidia = true;
  systemd.enableUnifiedCgroupHierarchy = false;
  hardware.nvidia-container-toolkit.enable = true;

@arximboldi arximboldi changed the title nvidia-container-runtima fails to run containers with the -it flag nvidia-container-runtime fails to run containers with the -it flag Jun 25, 2024
@arximboldi
Copy link
Contributor Author

arximboldi commented Jun 25, 2024

Fixed the last error with: #319201

  hardware.nvidia-container-toolkit.mount-nvidia-executables = false;

Now I get:

sudo podman run --runtime nvidia --rm --gpus all docker.io/nvidia/vulkan:1.3-470 vulkaninfo
WARN[0000] Using cgroups-v1 which is deprecated in favor of cgroups-v2 with Podman v5 and will be removed in a future version. Set environment variable `PODMAN_IGNORE_CGROUPSV1_WARNING` to hide this warning. 
Cannot create Vulkan instance.
This problem is often caused by a faulty installation of the Vulkan driver or attempting to use a GPU that does not support Vulkan.
ERROR at /vulkan-sdk/1.3.204.1/source/Vulkan-Tools/vulkaninfo/vulkaninfo.h:649:vkCreateInstance failed with ERROR_INCOMPATIBLE_DRIVER

Strange as:

nvidia-smi
Tue Jun 25 14:35:30 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.239.06   Driver Version: 470.239.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:2D:00.0  On |                  N/A |
|  0%   49C    P5    32W / 270W |    558MiB /  7979MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1944      G   ...xorg-server-21.1.13/bin/X      371MiB |
|    0   N/A  N/A      4823      G   ...26.0/bin/.firefox-wrapped      184MiB |
+-----------------------------------------------------------------------------+

@arximboldi
Copy link
Contributor Author

Ok, managed to get it to work with podman. The problem persists with docker.

@ereslibre
Copy link
Member

ereslibre commented Jul 26, 2024

CDI support, included in Docker 25, and in Podman since a long time ago, should be the way to go to expose GPU's to containers.

You can set hardware.nvidia-container-toolkit.enable = true; and virtualisation.docker.package to pkgs.docker_25 at least (until #330130 merges).

With CDI you can run docker run --rm --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi or podman run --rm --device=nvidia.com/gpu=all ubuntu:latest nvidia-smi.

@geekodour
Copy link
Contributor

geekodour commented Jul 28, 2024

The following 3 settings are somehow broken, or the error/warning messages are broken, I get the following:

  • hardware.nvidia-container-toolkit.enable
  • virtualisation.containers.cdi.dynamic.nvidia.enable
  • virtualisation.docker.enableNvidia
 - trace: warning: The option
   `virtualisation.containers.cdi.dynamic.nvidia.enable' defined in
   `/nix/store/05mr6xlxl0zxpvpbryjz4vi5m241njd0-source/newnixsetup/machines/hq/configuration.nix'
   has been renamed to `hardware.nvidia-container-toolkit.enable'.
 - trace: warning: You have set virtualisation.docker.enableNvidia. This option is
   deprecated, please set virtualisation.containers.cdi.dynamic.nvidia.enable
   instead.
  • These two errors come together which means only setting hardware.nvidia-container-toolkit.enable should solve everything, but it does not so we need to keep using deprecated settings(which also has other issues and i am not yet able to get everything working perfectly) but now we get these warnings.
  • something seems broken.

The problems:

If I just set hardware.nvidia-container-toolkit.enable, I get could not select device driver "" with capabilities: [[gpu]].: NVIDIA/nvidia-docker#1034 which points to nvidia-container-toolkit not installing the toolkit correctly plus I cannot access nvidia-ctk on my PATH.

NOTE: I am on 24.05 on which some folks have reported issues:
https://discourse.nixos.org/t/nvidia-container-runtime-exit-status-125-unknown/48306/3

@ereslibre
Copy link
Member

ereslibre commented Jul 28, 2024

Hello @geekodour!

virtualisation.containers.cdi.dynamic.nvidia.enable, same as virtualisation.docker.enableNvidia are both deprecated.

The right setting is hardware.nvidia-container-toolkit.enable.

If I just set hardware.nvidia-container-toolkit.enable, I get could not select device driver "" with capabilities: [[gpu]].: NVIDIA/nvidia-docker#1034 which points to nvidia-container-toolkit not installing the toolkit correctly plus I cannot access nvidia-ctk on my PATH.

This problem points to having an old Docker version. The default docker version as of now in nixpkgs in 24.05 is Docker 24, which does not have support for CDI. This default is being bumped in #330109.

Until #330109 merges, you need to set virtualisation.docker.package to pkgs.docker_25.

However, I agree that virtualisation.docker.enableNvidia suggesting virtualisation.containers.cdi.dynamic.nvidia.enable is an overlook on our side, and it should suggest hardware.nvidia-container-toolkit.enable instead. I have created #330617 to fix this issue.

@SomeoneSerge
Copy link
Contributor

Until #330109 merges, you need to set virtualisation.docker.package to pkgs.docker_25.

  • Do we not have assertions = [ ... ] for that?
  • We could also change the default when CDI is enabled

@ereslibre
Copy link
Member

Do we not have assertions = [ ... ] for that?

I'll add one later today.

We could also change the default when CDI is enabled

I'd say it's safer to put an assertion, instead of bumping Docker without the user noticing to a different default.

@geekodour
Copy link
Contributor

This problem points to having an old Docker version. The default docker version as of now in nixpkgs in 24.05 is Docker 24

I am using pkgs.docker_25; I even tried bumping it upto 27. no dice.

But I was wondering why after enabling hardware.nvidia-container-toolkit.enable there's no nvidia-ckt in the path and only the latter of the following 2 work:

# workes when  cdi disabled but stops working when cdi enabled       
docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi 
# works when using cdi enabled, but nvidia-ctk will be missing in path
docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi 

@geekodour
Copy link
Contributor

geekodour commented Jul 28, 2024

Thanks a lot for quick reply @ereslibre ! I am trying use nixos+nomad+docker+nvidia: mentioned in details here: hashicorp/nomad#24990 and I am a bit unclear if CDI is forward backward compatible with these orchestration tools(nomad/k8s) or we still need to keep supporting non-cdi ways, which as mentioned above gets broken when using cdi.

Apologies for dumping in a lot of questions in a separate issue, just got confused!

@geekodour
Copy link
Contributor

geekodour commented Jul 28, 2024

λ docker info|grep -i runtime                                                                                                                                                                              
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc

After setting daemon settings:

  hardware.nvidia-container-toolkit.enable = true; # TODO: this is the newer one
  virtualisation.docker = {
    # enableNvidia = true; # deprecated usage (but sets runtime and executables correctly in the host)
    package = pkgs.docker_25;
    daemon.settings = {
        default-runtime = "nvidia";
        runtimes.nvidia.path = "${pkgs.nvidia-docker}/bin/nvidia-container-runtime";
        exec-opts = ["native.cgroupdriver=cgroupfs"];
    };
  };
λ docker info|grep -i runtime                                                                                                                                                                                                                                                                                             
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: nvidia

Is hardware.nvidia-container-toolkit.enable supposed to update the daemon settings that I had to set manually here?^

@ereslibre
Copy link
Member

ereslibre commented Jul 28, 2024

But I was wondering why after enabling hardware.nvidia-container-toolkit.enable there's no nvidia-ckt in the path and only the latter of the following 2 work:

Enabling hardware.nvidia-container-toolkit.enable does not put nvidia-ctk in the PATH. It creates a systemd unit (nvidia-container-toolkit-cdi-generator.service) that calls to nvidia-ctk cdi generate and puts the result (after massagging the JSON) into /var/run/cdi/nvidia-container-toolkit.json.

With CDI, you don't specify the GPU's as you do with the runtime wrappers. So the second option with CDI is the correct one (--device=nvidia.com/gpu=all), not the first one (--gpus all).

With CDI also, there's no need for wrappers, so this:

    daemon.settings = {
        default-runtime = "nvidia";
        runtimes.nvidia.path = "${pkgs.nvidia-docker}/bin/nvidia-container-runtime";
        exec-opts = ["native.cgroupdriver=cgroupfs"];
    };

can go away entirely. CDI completely removes the need for runtime wrappers. However, I don't know about nomad. I would be willing to help in whatever is needed though.

@geekodour
Copy link
Contributor

Thanks a lot @ereslibre for the detailed explanation, super helpful!

I finally got this to work with nomad with a very hacky combination: hashicorp/nomad#24990 (described here) but this combination uses enableNvidia so I think even if we show the deprecation warning, I think we should not remove enableNvidia totally in the next release etc until this gets fixed if that's being planned.

Let me know what do you think of this!

@ereslibre
Copy link
Member

@geekodour maybe it's worth opening a specific issue about nomad + NixOS + Nvidia. However, and without having worked with nomad at all, I see we have:

  • services.nomad.enableDocker: this sets virtualisation.docker.enable.

From what I understand, you can configure Nomad to use the Docker socket, like so:

plugin "docker" {
  config {
    endpoint = "unix:///var/run/docker.sock"
    ...
  }
  ...
}

You can configure NixOS, so that Docker is able to use the nvidia-container-toolkit to generate the CDI spec, and configure Docker to use the CDI experimental feature:

hardware.nvidia-container-toolkit.enable = true;
virtualisation.docker.package = pkgs.docker_25;
...

With the previous settings, the docker daemon will be correctly configured to use CDI, and since Nomad is using the Docker socket, so that it should be capable of using GPU's. Now, the only remaining part is how Nomad tells the Docker socket "--device=nvidia.com/gpu=all" when it creates a container using the socket API. Looking at https://github.com/docker/cli/pull/4084/files#diff-e546b85d27f9dbea4fd41e83a58e36bb540bcda3e304355d56c92a119cd5aa2aR568-R575, I assume Nomad needs to pass the deviceRequests with something like the following:

[]container.DeviceRequest{
  {
    Driver:    "cdi",
    DeviceIDs: []string{"nvidia.com/gpu=all"},
  },
},

I guess this needs to be adapted on the Docker driver for Nomad.

@leoperegrino
Copy link

@ereslibre sorry, but I'm trying to setup linuxserver.io/jellyfin with support for nvidia but until now I got no success whatsoever.
According to the linuxserver documentation, one has to use --runtime=nvidia. Also according to the official jellyfin docs, it's necessary to add a deploy section, though I'm not entirely sure if it's needed for the linuxserver.io image.
I'm using a docker compose so I have the following:

  jellyfin:
    image: lscr.io/linuxserver/jellyfin:latest
    container_name: jellyfin
    restart: unless-stopped
    environment:
      - PUID=${PUID}
      - PGID=${PGID}
      - TZ=${TZ}
      - JELLYFIN_PublishedServerUrl=${JELLYFIN_URL} # optional
      - NVIDIA_VISIBLE_DEVICES=all
    runtime: nvidia
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: ["gpu"]

I followed the nixos docs to enable nvidia driver

	hardware = {
		graphics.enable = true;
		nvidia-container-toolkit.enable = true;
		nvidia = {
			# Modesetting is required.
			modesetting.enable = true;
			powerManagement = {
				enable = false;
				finegrained = false;
			};
			open = false;
			nvidiaSettings = true;
			nvidiaPersistenced = true;
			package = config.boot.kernelPackages.nvidiaPackages.stable;
		};
	};

	services = {
		xserver = {
			enable = false;  # this is a headless machine
			videoDrivers = ["nvidia"];
		};
	};

        virtualisation.docker.package = pkgs.docker_25;

I'm running unstable 24.11.20240804.cb9a96f (Vicuna).
systemctl status nvidia-container-toolkit-cdi-generator.service gives:

● nvidia-container-toolkit-cdi-generator.service - Container Device Interface (CDI) for Nvidia generator
     Loaded: loaded (/etc/systemd/system/nvidia-container-toolkit-cdi-generator.service; enabled; preset: enabled)
     Active: active (exited) since Fri 2024-08-09 17:26:18 -03; 37min ago
 Invocation: 3e560b7cf4444eff9b5fc1cac7e67136
   Main PID: 963 (code=exited, status=0/SUCCESS)
   Mem peak: 41.9M
        CPU: 138ms

Aug 09 17:26:17 coolermaster nvidia-cdi-generator[977]: time="2024-08-09T17:26:17-03:00" level=info msg="Selecting /nix/store/6ysl62ncf5gifwvdky69pzzkgxckhrcm-firmware/lib/firmware/nvidia/555>
Aug 09 17:26:17 coolermaster nvidia-cdi-generator[977]: time="2024-08-09T17:26:17-03:00" level=warning msg="Could not locate nvidia-smi: pattern nvidia-smi not found"
Aug 09 17:26:17 coolermaster nvidia-cdi-generator[977]: time="2024-08-09T17:26:17-03:00" level=warning msg="Could not locate nvidia-debugdump: pattern nvidia-debugdump not found"
Aug 09 17:26:17 coolermaster nvidia-cdi-generator[977]: time="2024-08-09T17:26:17-03:00" level=warning msg="Could not locate nvidia-persistenced: pattern nvidia-persistenced not found"
Aug 09 17:26:17 coolermaster nvidia-cdi-generator[977]: time="2024-08-09T17:26:17-03:00" level=warning msg="Could not locate nvidia-cuda-mps-control: pattern nvidia-cuda-mps-control not found"
Aug 09 17:26:17 coolermaster nvidia-cdi-generator[977]: time="2024-08-09T17:26:17-03:00" level=warning msg="Could not locate nvidia-cuda-mps-server: pattern nvidia-cuda-mps-server not found"
Aug 09 17:26:17 coolermaster nvidia-cdi-generator[977]: time="2024-08-09T17:26:17-03:00" level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not f>
Aug 09 17:26:17 coolermaster nvidia-cdi-generator[977]: time="2024-08-09T17:26:17-03:00" level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidia.so.555.58.02: pattern nvidia/xorg/>
Aug 09 17:26:17 coolermaster nvidia-cdi-generator[977]: time="2024-08-09T17:26:17-03:00" level=info msg="Generated CDI spec with version 0.5.0"
Aug 09 17:26:18 coolermaster systemd[1]: Finished Container Device Interface (CDI) for Nvidia generator.

If I run docker run --rm --device=nvidia.com/gpu=all lscr.io/linuxserver/jellyfin nvidia-smi, I do in fact get a running container with the nvidia-smi output correct. But each time I run docker compose up jellyfin I get Error response from daemon: unknown or invalid runtime name: nvidia. What am I missing here? How can I run docker compose and get the container with nvidia support?

@geekodour
Copy link
Contributor

@leoperegrino it seems like you've hit similar issues like mine: #322400 (comment) I think the issue has to be fixed upstream in the package you're using similar to how some change would be needed in nomad in my case( I am yet to make that PR to nomad)

But I'd suggest, try the workaround that I suggested in my comment. It requires a downgrade but worked for me. But the longterm fix will be what @ereslibre mentioned in this(#322400 (comment)) comment.

@leoperegrino
Copy link

@geekodour Thank you for the reply. The only way I managed to run with docker compose was with the following options:

virtualisation.docker.enableNvidia = true;
hardware.graphics.enable32Bits = true;

and removing runtime: nvidia from the docker-compose.yaml snippet I posted in my last comment.

Though I could use the non-deprecated option hardware.nvidia-container-toolkit.enable and run it with a docker run script like this:

# insert your other cli options as needed
exec docker run \
    --detach \
    --device=nvidia.com/gpu=all \
    --name jellyfin \
    --restart unless-stopped \
    --env NVIDIA_VISIBLE_DEVICES=all \
    lscr.io/linuxserver/jellyfin

it is not very ideal to have it outside of the compose project I manage with lots of other integrated services.

Until nvidia-container-toolkit is not able to support this use case of docker compose, I'll unfortunately keep the deprecated enableNvidia option.

@ironicbadger
Copy link
Contributor

Bit unfortunate this changed mid 24.05. Isn't this supposed to be a stable release?

@kirillrdy
Copy link
Member

Bit unfortunate this changed mid 24.05. Isn't this supposed to be a stable release?

this was due to security vulnerability in docker v24

if you switch to hardware.nvidia-container-toolkit.enable docker nvidia will work again, just need to use new args to pass GPU
eg
--device=nvidia.com/gpu=0

@leiserfg
Copy link
Contributor

leiserfg commented Sep 8, 2024

To use cdi with docker-compose one is supposed to do:

    deploy:
      resources:
        reservations:
          devices:
           - driver: cdi
             device_ids:
                - nvidia.com/gpu=all

From what I found after bouncing around in github.

@ereslibre
Copy link
Member

ereslibre commented Sep 9, 2024

Just to add to #322400 (comment), you can choose specific devices to be exposed by their ID. Had you say, 5 GPU's, if you wanted to expose 0, 2 and 4 only you can do:

    deploy:
      resources:
        reservations:
          devices:
           - driver: cdi
             device_ids:
                - nvidia.com/gpu=0
                - nvidia.com/gpu=2
                - nvidia.com/gpu=4

@leoperegrino
Copy link

@ereslibre @leiserfg
thank you for the hint! I was able to reproduce with hardware.graphics and hardware.nvidia-container-toolkit enabled using nixos unstable (24.11.20240906.574d1ea (Vicuna)).

Using 24.05 (24.05.20240817.c42fcfb (Uakari)) I had no success. I went through a bunch of different errors including:

  • nvidia-container-toolkit-cdi-generator.service giving level=error msg="failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_LIBRARY_NOT_FOUND"
  • docker compose giving Error response from daemon: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all

Didn't exactly tracked what was causing what but finally got it working in unstable and with driver: cdi.

@ereslibre
Copy link
Member

Hello @leoperegrino!

nvidia-container-toolkit-cdi-generator.service giving level=error msg="failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_LIBRARY_NOT_FOUND"

Maybe you forgot to add nvidia to services.xserver.videoDrivers = ["nvidia"];? I think we should probably do this automatically when you enable hardware.nvidia-container-toolkit.

Error response from daemon: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all

This was probably related to the previous error. If the nvidia-container-toolkit cdi generate command fails, the file /var/run/cdi/nvidia-container-toolkit.json will be empty, so that when you do docker|podman run --device nvidia.com/gpu=all, this is not found and causes this "unresolvable CDI devices" error.

@leoperegrino
Copy link

@ereslibre
I did put the video driver configuration so I don't think that was the issue. But your comment made me realize one thing that could be the cause. Since I'm deploying this flake in a headless machine, which I access only through ssh, I had services.xserver.enabled = false;. I assumed that xserver would only be needed if had GUI/DEs.

Looking at nixpkgs I see that the drivers are only inserted if xserver is enabled, which should result in missing drivers.

Should one enable xserver in this case even without GUI applications? If the xserver is only needed to insert the drivers, I guess hardware.nvidia-container-toolkit should insert by itself.

I'm not sure why using unstable with hardware.graphics worked, though.

@ereslibre
Copy link
Member

ereslibre commented Sep 9, 2024

I did put the video driver configuration so I don't think that was the issue. But your comment made me realize one thing that could be the cause. Since I'm deploying this flake in a headless machine, which I access only through ssh, I had services.xserver.enabled = false;. I assumed that xserver would only be needed if had GUI/DEs.

This is the same for me, my machine is headless and I access through SSH. However I still have to set services.xserver.videoDrivers = ["nvidia"];. I also have services.xserver.enable unset (so the default is false).

What's important is that when you alter services.xserver.videoDrivers and run the activation, you need to restart the machine for that to take effect.

Could it be that you had the nvidia value in the services.xserver.videoDrivers, and that it didn't work until you restarted the machine?

Should one enable xserver in this case even without GUI applications? If the xserver is only needed to insert the drivers, I guess hardware.nvidia-container-toolkit should insert by itself.

No, no need to services.xserver.enable = true;, but you still need to add "nvidia" to the services.xserver.videoDrivers list. This will add nvidia_x11 to the environment.systemPackages list among other things.

nvidiaEnabled = (lib.elem "nvidia" config.services.xserver.videoDrivers);
nvidia_x11 = if nvidiaEnabled || cfg.datacenter.enable then cfg.package else null;
and what follows using them is what changes the behavior :)

@leoperegrino
Copy link

leoperegrino commented Sep 9, 2024

I did restart but kept getting those error still. Eventually if I find anything new, I'll get back here. Thanks!

@ironicbadger
Copy link
Contributor

ironicbadger commented Sep 9, 2024

Can confirm I had to enable xserver with nvidia and reboot to get this going.

The compose CDI driver snippet was immensely helpful. Thank you for sharing that.

I believe the minimum viable config to be this:

# configuration.nix

  virtualisation = {
    docker = {
      enable = true;
      package = pkgs.docker_25;
      enableNvidia = true;
    };
  };

  hardware.nvidia.modesetting.enable = true;
  hardware.nvidia.nvidiaSettings = true;
  hardware.nvidia.powerManagement.enable = true;
  hardware.nvidia-container-toolkit.enable = true;

  services.xserver = {
    enable = true;
    videoDrivers = [ "nvidia" ];
  };

And here's a working compose snippet for ollama with an A4000

---
services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    volumes:
      - /appdata/apps/ollama:/root/.ollama
    ports:
      - 11434:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: cdi
              device_ids:
                - nvidia.com/gpu=all
    restart: unless-stopped

@ereslibre
Copy link
Member

ereslibre commented Sep 10, 2024

Hello @ironicbadger!

Can confirm I had to enable xserver with nvidia and reboot to get this going.

With this, you mean you had to do services.xserver.enable = true; as in the config you show? It shouldn't be necessary to do that if it's a headless machine. Can you confirm?

Besides that, in your config, some comments:

  1. The settings you have hardware.nvidia.* are also optional, no need to specify them.
  2. virtualisation.docker.enableNvidia can be removed.
  3. virtualisation.docker.package can be removed if you are on unstable greater than b381163, given this commit sets Docker 27 as the new default.

So, in principle, the minimum viable config in your case should be (I am referring to nixpkgs at least ad416d0 that is the one I happen to have checked out at the moment):

  virtualisation.docker.enable = true;
  hardware = {
    graphics.enable = true;
    nvidia.open = false;
    nvidia-container-toolkit.enable = true;
  };
  services.xserver.videoDrivers = [ "nvidia" ];

For a headless machine, and:

  virtualisation.docker.enable = true;
  hardware = {
    graphics.enable = true;
    nvidia.open = false;
    nvidia-container-toolkit.enable = true;
  };
  services.xserver = {
    enable = true;
    videoDrivers = [ "nvidia" ];
  };

For a non-headless machine.

I am trying to confirm all this because I'm working on the documentation of this feature, and I'd like to double check that I'm not missing anything. Please, if you have the time it would be amazing if you could confirm that the minimum configuration I did show works for you too. Thank you!

@leiserfg
Copy link
Contributor

By the way, @ereslibre I see that nvidia-container-toolkit and libnvidia-container are outdated (the last one is two years old already).

@ereslibre
Copy link
Member

By the way, @ereslibre I see that nvidia-container-toolkit and libnvidia-container are outdated (the last one is two years old already).

I'm not super interested in bumping libnvidia-container myself. It's used by the nvidia-container-runtime, but I prefer the CDI approach myself.

That said, feel free to bump libnvidia-container and nvidia-container-toolkit, I don't see any PR's open bumping them.

@leoperegrino
Copy link

leoperegrino commented Sep 11, 2024

Is there a possibility of a race condition on computer boot for setting up the cdi configuration? Everytime the computer restarts the docker daemon is unable to start those with driver: cdi. sytemctl logs:

msg="failed to start container" container=c20614b2a2f6965fa6364584df66d6094b6a44aef0b406a96c0b794db280510d error="could not select device driver \"cdi\" with capabilities: [[]]"

running docker compose up after boot does work though.

@ereslibre
Copy link
Member

ereslibre commented Sep 11, 2024

How are they being executed @leoperegrino? Please open a new issue as we are mixing a ton of different and potentially unrelated things here.

If it was a race condition due to the CDI spec not being available, the error would read "unresolvable CDI devices":

❯ docker run -it --rm --device=this-vendor.com/does-not-exist-at=all ubuntu:latest
docker: Error response from daemon: CDI device injection failed: unresolvable CDI devices this-vendor.com/does-not-exist-exist-at=all.

@leoperegrino
Copy link

@ereslibre yeah, sorry for belabouring. I used with the following snippet and docker daemon should start all containers according to the restart specification.

To use cdi with docker-compose one is supposed to do:

    deploy:
      resources:
        reservations:
          devices:
           - driver: cdi
             device_ids:
                - nvidia.com/gpu=all

From what I found after bouncing around in github.

@ereslibre
Copy link
Member

@leoperegrino Please, open a new issue and we can follow up there. Thanks :)

@leoperegrino
Copy link

leoperegrino commented Sep 13, 2024

@ereslibre Hi, sorry for bumping again this thread but now I've got some insights related to our previous discussion. I managed to run in stable using the previously corrected deploy:

   deploy:
      resources:
        reservations:
          devices:
           - driver: cdi
             device_ids:
                - nvidia.com/gpu=all

I was not being able to run this way in stable before because hardware.graphics option does not exist right now. I found out that the option was renamed in this commit from hardware.opengl, which we didn't mention in our debugging.

I noticed that without that option my /var/run/cdi/nvidia-container-toolkit.json was empty despite enabling hardware.nvidia-container-toolkit, which caused errors such as CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all.

In summary, in nixos stable (24.05.20240817.c42fcfb), I'm running these options:

  hardware = {
    opengl.enable = true;
    nvidia-container-toolkit.enable = true;
    nvidia = {
       ...
    };
  };

  services = {
    xserver = {
      enable = false;
      videoDrivers = [ "nvidia" ];
    };
  };

I hope that this discussion was helpful for you as it was for me, since you said that you are the one documenting this. Might be important to have in mind the different options for different branches.

Thanks!

@ereslibre
Copy link
Member

Thank you for the followup @leoperegrino

@jackheuberger
Copy link

Hi @leoperegrino. Sorry to necropost but can you send your final jellyfin & hardware configurations? I'm running into the exact same issue but I'm not sure what I'm missing

@leoperegrino
Copy link

leoperegrino commented Dec 13, 2024

@jackheuberger my final configuration is the one in my last comment even though with that I didn't manage to solve the lack of restart. Right now I'm not using that machine so I can't help much beyond the comments written here. Pay attention to your nixos version and which option should you use.

@ereslibre
Copy link
Member

ereslibre commented Dec 13, 2024

@leoperegrino oh, regarding the restart issue, there is #353059. I found a bug in Docker upstream, will report on #353059. Didn't open the PR to upstream Docker yet, because I am working on a small reproducer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants