-
-
Notifications
You must be signed in to change notification settings - Fork 718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disabling GPU diagnostics prevents GPU executor from getting created #8338
Comments
I can speak for the NVML diagnostics side, and we intentionally chose to decide whether to enable diagnostics or not when
We could separate all of those things but it adds a ton of boilerplate code that's easy to miss, even just having to test whether NVML is enabled and the device count is, from previous experience, easy enough to introduce unintended consequences. With that said, so far we assumed that if GPU diagnostics are disabled, everything GPU-related is intended to be disabled. I wouldn't mind making changes so that we can still query whether GPUs are available even if the NVML toggle is off if there's use for you. I was not aware of the existence of |
@pentschev: Thanks for the additional context!
That generally makes a lot of sense to me.
This makes less sense to me. Shouldn't it instead be the other way around? If GPUs are generally enabled/disabled, then diagnostics should follow suit (possibly with the additional ability to turn off diagnostics either way).
It was added a while ago (#5123), so I assumed that this was used by something. A more general question: Are NVML diagnostics independent of the system monitor? If not, I'm wondering if we should move the toggle there to be close to the other system monitor toggles |
Yes, I would agree with that, but Distributed had no GPU code (at least to my knowledge) other than monitoring, thus we were making the assumption in the opposite direction (diagnostics disabled == intent to have no GPU support).
I don't mind moving the toggle, I think it was introduced in the NVML diagnostics module just to be the least intrusive as possible, if you see benefit in having it as part of system monitor I'm +1 for that. |
Alright, I'll take another look at what distributed's NVML module does in detail to see if my proposal makes sense. |
cc @dask/gpu for visibility |
For context, Matt suggested in #5084 to add an annotation to all RAPIDS layers such that the tasks are scheduled on this magical threadpool #5084 (comment) Given this conversation, I suspect this never happened? Would anybody be upset if we removed that? |
I had completely forgotten about those discussions, sorry about that. I don't think that was ever made, for me personally I failed to see the benefits and use cases of doing that so it was probably never touched in RAPIDS/Dask-CUDA. No objections from me on removing that to reduce maintenance burden. |
After filing #8337, I ran across
distributed.diagnostics.nvml
, which seems to be supposed to serve the same purpose of toggling diagnotics on GPUs. However, after taking a deeper look, it seems as though this setting has some unintended consequences and disables the GPU executor altogether:In
distributed.diagnostics.nvml
, it is used withininit_once
to decide whether to initializepynvml
on a process:distributed/distributed/diagnostics/nvml.py
Lines 89 to 93 in 0dc9e88
The initialization is then checked within
device_get_count()
, which returns 0 if the toggle is turned off:distributed/distributed/diagnostics/nvml.py
Lines 125 to 128 in 0dc9e88
On the worker, this function is then used to decide whether to initialize the GPU executor:
distributed/distributed/worker.py
Lines 639 to 642 in 0dc9e88
I'm not an expert on this particular area of code, but I think that we should separate initialization of
pynvml
from the diagnostics toggle. We may also want to consider moving diagnostics toggle into the system monitor (akin to #8337) to have a conceptual grouping.The text was updated successfully, but these errors were encountered: