-
-
Notifications
You must be signed in to change notification settings - Fork 14.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia-container-runtime
fails to run containers with the -it
flag
#322400
Comments
Pinging @aaronmondal as you seemed to have the same issue here, not sure how it got resolved if at all? |
Tried
Also for completenes:
|
nvidia-container-runtima
fails to run containers with the -it
flagnvidia-container-runtime
fails to run containers with the -it
flag
Fixed the last error with: #319201
Now I get:
Strange as:
|
Ok, managed to get it to work with |
CDI support, included in Docker 25, and in Podman since a long time ago, should be the way to go to expose GPU's to containers. You can set With CDI you can run |
The following 3 settings are somehow broken, or the error/warning messages are broken, I get the following:
The problems: If I just set NOTE: I am on 24.05 on which some folks have reported issues: |
Hello @geekodour!
The right setting is
This problem points to having an old Docker version. The default docker version as of now in nixpkgs in 24.05 is Docker 24, which does not have support for CDI. This default is being bumped in #330109. Until #330109 merges, you need to set However, I agree that |
|
I'll add one later today.
I'd say it's safer to put an assertion, instead of bumping Docker without the user noticing to a different default. |
I am using pkgs.docker_25; I even tried bumping it upto 27. no dice. But I was wondering why after enabling
|
Thanks a lot for quick reply @ereslibre ! I am trying use nixos+nomad+docker+nvidia: mentioned in details here: hashicorp/nomad#24990 and I am a bit unclear if CDI is Apologies for dumping in a lot of questions in a separate issue, just got confused! |
After setting daemon settings: hardware.nvidia-container-toolkit.enable = true; # TODO: this is the newer one
virtualisation.docker = {
# enableNvidia = true; # deprecated usage (but sets runtime and executables correctly in the host)
package = pkgs.docker_25;
daemon.settings = {
default-runtime = "nvidia";
runtimes.nvidia.path = "${pkgs.nvidia-docker}/bin/nvidia-container-runtime";
exec-opts = ["native.cgroupdriver=cgroupfs"];
};
};
Is |
Enabling With CDI, you don't specify the GPU's as you do with the runtime wrappers. So the second option with CDI is the correct one ( With CDI also, there's no need for wrappers, so this:
can go away entirely. CDI completely removes the need for runtime wrappers. However, I don't know about nomad. I would be willing to help in whatever is needed though. |
Thanks a lot @ereslibre for the detailed explanation, super helpful! I finally got this to work with nomad with a very hacky combination: hashicorp/nomad#24990 (described here) but this combination uses Let me know what do you think of this! |
@geekodour maybe it's worth opening a specific issue about nomad + NixOS + Nvidia. However, and without having worked with nomad at all, I see we have:
From what I understand, you can configure Nomad to use the Docker socket, like so:
You can configure NixOS, so that Docker is able to use the nvidia-container-toolkit to generate the CDI spec, and configure Docker to use the CDI experimental feature:
With the previous settings, the docker daemon will be correctly configured to use CDI, and since Nomad is using the Docker socket, so that it should be capable of using GPU's. Now, the only remaining part is how Nomad tells the Docker socket "--device=nvidia.com/gpu=all" when it creates a container using the socket API. Looking at https://github.com/docker/cli/pull/4084/files#diff-e546b85d27f9dbea4fd41e83a58e36bb540bcda3e304355d56c92a119cd5aa2aR568-R575, I assume Nomad needs to pass the []container.DeviceRequest{
{
Driver: "cdi",
DeviceIDs: []string{"nvidia.com/gpu=all"},
},
}, I guess this needs to be adapted on the Docker driver for Nomad. |
@ereslibre sorry, but I'm trying to setup linuxserver.io/jellyfin with support for nvidia but until now I got no success whatsoever. jellyfin:
image: lscr.io/linuxserver/jellyfin:latest
container_name: jellyfin
restart: unless-stopped
environment:
- PUID=${PUID}
- PGID=${PGID}
- TZ=${TZ}
- JELLYFIN_PublishedServerUrl=${JELLYFIN_URL} # optional
- NVIDIA_VISIBLE_DEVICES=all
runtime: nvidia
deploy:
resources:
reservations:
devices:
- capabilities: ["gpu"]
I followed the nixos docs to enable nvidia driver hardware = {
graphics.enable = true;
nvidia-container-toolkit.enable = true;
nvidia = {
# Modesetting is required.
modesetting.enable = true;
powerManagement = {
enable = false;
finegrained = false;
};
open = false;
nvidiaSettings = true;
nvidiaPersistenced = true;
package = config.boot.kernelPackages.nvidiaPackages.stable;
};
};
services = {
xserver = {
enable = false; # this is a headless machine
videoDrivers = ["nvidia"];
};
};
virtualisation.docker.package = pkgs.docker_25; I'm running unstable
If I run |
@leoperegrino it seems like you've hit similar issues like mine: #322400 (comment) I think the issue has to be fixed upstream in the package you're using similar to how some change would be needed in nomad in my case( I am yet to make that PR to nomad) But I'd suggest, try the workaround that I suggested in my comment. It requires a downgrade but worked for me. But the longterm fix will be what @ereslibre mentioned in this(#322400 (comment)) comment. |
@geekodour Thank you for the reply. The only way I managed to run with virtualisation.docker.enableNvidia = true;
hardware.graphics.enable32Bits = true; and removing Though I could use the non-deprecated option # insert your other cli options as needed
exec docker run \
--detach \
--device=nvidia.com/gpu=all \
--name jellyfin \
--restart unless-stopped \
--env NVIDIA_VISIBLE_DEVICES=all \
lscr.io/linuxserver/jellyfin it is not very ideal to have it outside of the compose project I manage with lots of other integrated services. Until |
Bit unfortunate this changed mid |
this was due to security vulnerability in docker v24 if you switch to |
To use cdi with docker-compose one is supposed to do: deploy:
resources:
reservations:
devices:
- driver: cdi
device_ids:
- nvidia.com/gpu=all From what I found after bouncing around in github. |
Just to add to #322400 (comment), you can choose specific devices to be exposed by their ID. Had you say, 5 GPU's, if you wanted to expose 0, 2 and 4 only you can do: deploy:
resources:
reservations:
devices:
- driver: cdi
device_ids:
- nvidia.com/gpu=0
- nvidia.com/gpu=2
- nvidia.com/gpu=4 |
@ereslibre @leiserfg Using
Didn't exactly tracked what was causing what but finally got it working in unstable and with |
Hello @leoperegrino!
Maybe you forgot to add
This was probably related to the previous error. If the |
@ereslibre Looking at nixpkgs I see that the drivers are only inserted if xserver is enabled, which should result in missing drivers. Should one enable xserver in this case even without GUI applications? If the xserver is only needed to insert the drivers, I guess I'm not sure why using unstable with |
This is the same for me, my machine is headless and I access through SSH. However I still have to set What's important is that when you alter Could it be that you had the
No, no need to nixpkgs/nixos/modules/hardware/video/nvidia.nix Lines 8 to 9 in f6b2548
|
I did restart but kept getting those error still. Eventually if I find anything new, I'll get back here. Thanks! |
Can confirm I had to enable xserver with nvidia and reboot to get this going. The compose CDI driver snippet was immensely helpful. Thank you for sharing that. I believe the minimum viable config to be this:
And here's a working compose snippet for ollama with an A4000
|
Hello @ironicbadger!
With this, you mean you had to do Besides that, in your config, some comments:
So, in principle, the minimum viable config in your case should be (I am referring to nixpkgs at least ad416d0 that is the one I happen to have checked out at the moment): virtualisation.docker.enable = true;
hardware = {
graphics.enable = true;
nvidia.open = false;
nvidia-container-toolkit.enable = true;
};
services.xserver.videoDrivers = [ "nvidia" ]; For a headless machine, and: virtualisation.docker.enable = true;
hardware = {
graphics.enable = true;
nvidia.open = false;
nvidia-container-toolkit.enable = true;
};
services.xserver = {
enable = true;
videoDrivers = [ "nvidia" ];
}; For a non-headless machine. I am trying to confirm all this because I'm working on the documentation of this feature, and I'd like to double check that I'm not missing anything. Please, if you have the time it would be amazing if you could confirm that the minimum configuration I did show works for you too. Thank you! |
By the way, @ereslibre I see that nvidia-container-toolkit and libnvidia-container are outdated (the last one is two years old already). |
I'm not super interested in bumping That said, feel free to bump |
Is there a possibility of a race condition on computer boot for setting up the cdi configuration? Everytime the computer restarts the docker daemon is unable to start those with
running |
How are they being executed @leoperegrino? Please open a new issue as we are mixing a ton of different and potentially unrelated things here. If it was a race condition due to the CDI spec not being available, the error would read "unresolvable CDI devices":
|
@ereslibre yeah, sorry for belabouring. I used with the following snippet and docker daemon should start all containers according to the
|
@leoperegrino Please, open a new issue and we can follow up there. Thanks :) |
@ereslibre Hi, sorry for bumping again this thread but now I've got some insights related to our previous discussion. I managed to run in stable using the previously corrected
I was not being able to run this way in stable before because I noticed that without that option my In summary, in nixos stable ( hardware = {
opengl.enable = true;
nvidia-container-toolkit.enable = true;
nvidia = {
...
};
};
services = {
xserver = {
enable = false;
videoDrivers = [ "nvidia" ];
};
}; I hope that this discussion was helpful for you as it was for me, since you said that you are the one documenting this. Might be important to have in mind the different options for different branches. Thanks! |
Thank you for the followup @leoperegrino |
Hi @leoperegrino. Sorry to necropost but can you send your final jellyfin & hardware configurations? I'm running into the exact same issue but I'm not sure what I'm missing |
@jackheuberger my final configuration is the one in my last comment even though with that I didn't manage to solve the lack of restart. Right now I'm not using that machine so I can't help much beyond the comments written here. Pay attention to your nixos version and which option should you use. |
@leoperegrino oh, regarding the restart issue, there is #353059. I found a bug in Docker upstream, will report on #353059. Didn't open the PR to upstream Docker yet, because I am working on a small reproducer. |
Describe the bug
When passing
--runtime nvidia
todocker
and the-it
flag is passed, the following error is output:Steps To Reproduce
Steps to reproduce the behavior:
Expected behavior
The container runs properly.
Notify maintainers
@averelld (author of commit that added the virtualisation.docker.enableNvidia option)
@cpcloud (nvidia-docker maintainer)
Metadata
Please run
nix-shell -p nix-info --run "nix-info -m"
and paste the result.Add a 👍 reaction to issues you find important.
The text was updated successfully, but these errors were encountered: