-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nvidia passthrough broken #4
Comments
What does the log say after "Starting jail with the following command:" when you start the jail? Also what is the output of Thanks for testing an reporting! |
and
|
Thanks! I have just updated the python script. Could you try again please? |
Deleted old jail, started over with fresh new jail, getting this error when trying to start:
|
More error detail - appears to be an extraneous `--bind-ro==' in the generated CLI now?
|
Ah you're right that doesn't look good. If you replace the double == with single ones and run the command itself directly to start the jail, does the Nvidia driver work inside the jail? If so we know this approach will work and I should fix the double == in the code. Thanks for helping. Since I don't have Nvidia GPU I couldn't test this part :) |
Should be fixed now. |
Still same problem - I notice you changed this:
However it still isn't appending {file_path}, it just outputs a blank "--bind-ro=" and that is what stops the jail from starting. If I remove the blank line I can start the jail however Nvidia drivers still don't appear to work inside it. Something in the routine you have for mounting the directories using that subroutine to detect /dev or not seems to be broken but I can't see it. |
More info: The problem I outline above about the extraneous `--bind-ro==' appended to the launch string will prevent the machine from starting, however you can edit around that since it does appear to bind all the other directories, it's just adding that blank one at the end. I am not familiar enough with Python and how it handles loops (other than foreach is implicit) but that is likely simple to fix. The bigger issue is it's still not passing through everything needed from the host as the following error still happens even when I mod the startup to get the jail running:
|
Empty bind-ro line should now be fixed. Thanks! What happens when you run |
Also please try these steps inside a fresh jail: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/nvidia-docker.html |
I think running |
Thanks Jip-Hop - the empty bind-ro line is indeed fixed (and I learned something about python today reading your commit) however the Nvidia problems remain. Even running ldconfig in the jail, or in a nvidia container i.e.:
Still fails with the same error:
|
Also,
and As a sanity check, it does all work outside the jail, I double-checked to make sure I hadn't opened a shell on the wrong machine :) |
Could you try |
Edit: Also ran ldconfig |
I suppose there may still be a (config) file missing in the list of files to bind mount. This shows the approach should work: Maybe something is missing from our list? |
Aha!
|
libnvidia-ml.so isn't being passed to the jail;
Which doesn't appear to be bound in your script looking at how the directories are enumerated, unless I am missing a piece? |
And if so search with a wildcard at the end? It is being bind mounted but it has a different suffix... |
Maybe I need to do something similar to this: NVIDIA/nvidia-docker#1163 (comment) Too bad this needs additional investigation... |
Finds this yes |
O.k. so I now have hard-coded to also mount I now no longer get the error related to
Has this fixed it for you? |
Ah, progress :) Yes, now nvidia-smi picks it up in the jail itself, however it fails inside containers running in the jail. Looks like |
Actually, the problem is a little weirder. I run this (standard test, from the Nvidia site, done it dozens of times):
And get this error
But even adding the correct directory to my path in Bashrc/etc. doesn't fix it. Something else strange is going on here. |
Probably needs to be in path system wide not just for current user? But nice, progress! |
Something is off with it and it has to be due to how the drivers are pulled in from the host. We still might be missing something. |
When you get to the point that it works inside the jail, but not in a docker container, can you try (after having installed nvidia docker):
|
Just tried the latest update to test the nvidia part, and am also getting errors starting it. Config file looks fine.
I can run nvidia-smi
and
so it is there, just not being picked up? I can see if anything is passed through to the jail itself as can't get that running. |
Inside the jail please follow the official steps to get nvidia working with Docker: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/nvidia-docker.html That would also setup the daemon.json file with nvidia settings. Then please run
Looking forward to hearing how that goes. |
Any chance you could post the logs of the jailmaker script when it is starting dockerjail after a reboot? You may need to redirect the output somewhere with Or you could temporarily disable the startup script and run jlmkr manually after the reboot. I'm tempted to just call |
Sure, no problem. Here is the loadlog
There is no nvidia stuff in there. And here is the log after stopping and starting. In between ran nvidia-smi and nvidia-container-cli list
My post init command is as follows:
Unfortunately I won't be able to do a lot more testing for the next week as packing up the PC's soon and moving. Hopefully next week Thursday will have most of the stuff up and running so I can do more testing. |
Thanks @Talung. Was that with the latest script? I was expecting to see "No nvidia GPU seems to be present... Skip passthrough of nvidia GPU." in the first case. But I think it's clear that |
Yes. This morning read all the other posts, then ran the update and did a reboot. So unless the script has changed in the last 7 hours, that should be the latest script. Maybe add a little version number to the output so we can confirm that sort of thing. Also, a "run log" where you store the config would also be good for debugging. Just suggestions. :) |
Thanks @Talung. Versioning has started. We're at v0.0.1. If anyone could test the following sequence:
Was the jail started with nvidia gpu passthrough working (without manually running nvidia-smi or modprobe)? |
Did you change anything else on the script besides the versioning? Was going through those tests you suggested, got the latest script (with version numbers), disabled the post init run (but actually I didn't because I didn't hit the save button) and rebooted. Did the whole setup:
And then noticed I had an email from watchtower, which I then realised I didn't save the "disabled" change. However, this time the GPU iniitialised on boot. Here is the log:
Looking at the commit history, I see some other changes were made and whatever it was, it seems to have worked. EDIT: for funsies I did another reboot and guess what... GPU was in jail again! |
Sounds good! Thanks @Talung Yes I did more then increment the version number hehe ^^
This looks good, as it detected nvidia GPU straight after reboot thanks to And then you did another reboot and it ran So seems to be working now!? |
Well if working means that I did 2 reboots and GPU came up both times without issue in a jail with GPU passthrough, then I would say: "Yes, it is working!" Well done! |
Grabbed the lastest script and just tried a reboot myself, and I'm definitely running into the linked issue Everything 'seemed' to be working (nvidia-smi ran successfully in host, jail, and container), but Plex refused to do HW transcoding. I also tried a tensorflow docker container and my GPU wasn't listed. After poking around a while, I discovered that I didn't have I stopped the jail, ran the mknod for /dev/nvidia-uvm and /dev/nvidia-uvm-tools
Then re-started the jail, and transcoding in Plex worked! Tried the tensorflow container again and it listed my GPU. So it seems like 'something' is still missing to get the nvidia-uvm device created. |
Probably worth noting that I'm on TrueNAS scale 22.12.1 It seems that nvidia-modprobe doesn't work because the modules are named
But nvidia-modprobe is hard-coded to use I did get nvidia-modprobe to do the right thing by creating a symbolic link and running depmod
After that, /dev/nvidia-uvm exists. Since the mknod commands are documented by nvidia, that solution feels a bit less 'hacky' |
@TrueJournals I've had to have the following running as a pre-init command since at least 2 Scale releases past:
In order to keep the situation you are seeing from happening. That was true even when I was running docker off the Scale host itself. I've had it in there ever since and I haven't had the problem you are seeing. I think @Jip-Hop added it to the script as well but I believe it is something that needs to happen pre-init if you want your Nvidia GPU to reliably show up in Scale. Something to do with how IX Systems won't load it unless called upon to eliminate boot logging errors. The K3S backed app system handles it behind the scenes when it is used, we need to do it manually. |
Thanks for that tip @Ixian ! Looks like that will do it. Quick log from boot (without any special init):
Running Looks like the most recent commit removed the modprobe in favor of just running So, I guess this is the answer for the TODO @Jip-Hop -- |
@Jip-Hop I just went through the latest script (0.0.1 and thanks for adding versioning) and I think it's really coming together, like the changes, learned a few new things about Python too so thanks :) I'm using 0.0.1 now and so far so good, gone through multiple reboot tests and everything launches clean & my GPU works, I'm able to use hw transcoding in Plex & Tdarr (tested both after each). Haven't seen any other problems (performance, etc.) yet but will keep an eye on things. I think I'm ready to switch over to this full time vs. running docker directly on the host. Famous last words but: Fingers crossed :) |
Yep, I just saw he removed it as well BUT I think that's fine, I am pretty certain the correct order to load the modules during boot is pre-init so probably just an instruction to add it as a pre-init command is enough. That's what we did when we first started running DIY docker with Scale. Here's a screenshot @Jip-Hop if you want to add it to the readme: |
With the pre-init script, things are working -- but it looks like
I decided to just add nvidia-smi to my pre-init command. I also thought it might be a good idea to run modprobe-nvidia regardless of if the I also changed to detect the path to modprobe instead of relying on PATH or on a hard-coded path. Probably not necessary, but I found it interesting. So, my final pre-init command is:
|
I had no idea it would take 5 days and about 100 comments to get nvidia passthrough working >.< Updated the script to v0.0.2. I removed some code I think we no longer need, as long as the Would be great if you could run through the testing sequence again (and run whatever additional tests you think are relevant). If this works I'll add documentation regarding the pre-init command. P.S. @TrueJournals if you have an idea how to run |
Alright, you got me curious ;) I dug into this, because I was curious how nvidia handled this. So I dug through libnvidia-container and container-toolkit. Here's what I can tell... TLDR: They find all unique folders from nvidia has a hard-coded list of libraries in libnvidia-container. Actually, this is multiple lists depending on what capabilities you want in the container. In order to find the full path to these libraries, they parse the ldcache file directly to turn the short library names into full paths. You can see that also in find_library_paths Over in container-toolkit (which contains the 'hooks' for when containers are created or whatever), there's code to get a list of libraries from "mounts" (a little unclear what these mounts are -- assuming mounts on the container?) by matching paths against Finally, they can create a file in /etc/ld.so.conf.d with a random name that lists all these folders and run ldconfig. It looks like this happens outside the container itself by using the Now, what I'm still a little confused by is that I don't actually see this happening in my docker container. What's also weird is that libraries show up like this:
Even though that library is located at Anyway, the logic of 'discover the paths based on the list of libraries' seems reasonable enough. You could even run |
Thanks for digging into this :)
Well, then I will no longer feel bad for writing that file :') Using the output of By the way, how is v0.0.2 for you? :) |
Just tried v0.0.2 and it seems to work fine (I can only reboot my server so many times in a day 😆 ) Also sent you a PR to implement the above suggesting of discovering library paths based on the output of |
Updated to 0.0.3, rebooted, all working, Plex hw transcoding working. Question: Do we need to re-generate a new jail with each version i.e. has the cli launch command in the config file changed? I'm still testing with the jail I created with 0.0.1. |
Nice! The debugging we did with the script may have left some residual files (symlinks, empty folders), so recreating may not be a bad idea. But in general my intention is that there should not be a need to regenerate a jail when using a newer version of the script. |
I'm happy to close this now if you want, I think we've gotten it. |
100 comments and closed! 🎉 |
Can confirm this is working for me and my Emby is HW transcoding now. Thanks |
I've done some refactoring. Would be interested in knowing if Nvidia passthrough still works with the latest version. Anyone care to test? |
Sorry @Jip-Hop I have moved my entire system back over to Proxmox and running zfs and docker natively on that. TrueNAS became too much of a pain in the arse to get around their crap. Apologies for not being able to help. |
it works for me thanks |
Getting this error:
Looks like everything might not be getting passed through.
The text was updated successfully, but these errors were encountered: