Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling GPU diagnostics prevents GPU executor from getting created #8338

Closed
hendrikmakait opened this issue Nov 13, 2023 · 7 comments · Fixed by #8399
Closed

Disabling GPU diagnostics prevents GPU executor from getting created #8338

hendrikmakait opened this issue Nov 13, 2023 · 7 comments · Fixed by #8399
Assignees
Labels
bug Something is broken

Comments

@hendrikmakait
Copy link
Member

hendrikmakait commented Nov 13, 2023

After filing #8337, I ran across distributed.diagnostics.nvml, which seems to be supposed to serve the same purpose of toggling diagnotics on GPUs. However, after taking a deeper look, it seems as though this setting has some unintended consequences and disables the GPU executor altogether:

In distributed.diagnostics.nvml, it is used within init_once to decide whether to initialize pynvml on a process:

elif NVML_STATE == NVMLState.UNINITIALIZED and not dask.config.get(
"distributed.diagnostics.nvml"
):
NVML_STATE = NVMLState.DISABLED_CONFIG
return

The initialization is then checked within device_get_count(), which returns 0 if the toggle is turned off:

def device_get_count():
init_once()
if not is_initialized():
return 0

On the worker, this function is then used to decide whether to initialize the GPU executor:

if nvml.device_get_count() > 0:
self.executors["gpu"] = ThreadPoolExecutor(
1, thread_name_prefix="Dask-GPU-Threads"
)

I'm not an expert on this particular area of code, but I think that we should separate initialization of pynvml from the diagnostics toggle. We may also want to consider moving diagnostics toggle into the system monitor (akin to #8337) to have a conceptual grouping.

@pentschev
Copy link
Member

I can speak for the NVML diagnostics side, and we intentionally chose to decide whether to enable diagnostics or not when device_get_count() == 0 which may indeed cause some confusion as it packs more information than its name suggests:

  1. Is the NVML diagnostics toggle turned on?
  2. Is pynvml installed?
  3. Are NVML resources actually available (e.g., shared libraries)?
  4. And finally, are there GPUs installed and visible?

We could separate all of those things but it adds a ton of boilerplate code that's easy to miss, even just having to test whether NVML is enabled and the device count is, from previous experience, easy enough to introduce unintended consequences. With that said, so far we assumed that if GPU diagnostics are disabled, everything GPU-related is intended to be disabled.

I wouldn't mind making changes so that we can still query whether GPUs are available even if the NVML toggle is off if there's use for you. I was not aware of the existence of Dask-GPU-Threads executor and I'm not familiar with its purpose, in Dask-CUDA we do not use it even though we rely heavily on the Distributed NVML diagnostics module. With that said, if having another path that can still keep Dask-GPU-Threads enabled despite NVML diagnostics switched off I'm ok with that, I would still like to retain the current behavior of device_get_count() == 0 if NVML diagnostics are switched off though, and we can then perhaps introduce a new function or bool argument to device_get_count to still return the device count independent of the NVML diagnostics toggle.

@hendrikmakait
Copy link
Member Author

@pentschev: Thanks for the additional context!

I can speak for the NVML diagnostics side, and we intentionally chose to decide whether to enable diagnostics or not when device_get_count() == 0

That generally makes a lot of sense to me.

With that said, so far we assumed that if GPU diagnostics are disabled, everything GPU-related is intended to be disabled.

This makes less sense to me. Shouldn't it instead be the other way around? If GPUs are generally enabled/disabled, then diagnostics should follow suit (possibly with the additional ability to turn off diagnostics either way).

I was not aware of the existence of Dask-GPU-Threads executor and I'm not familiar with its purpose, in Dask-CUDA we do not use it even though we rely heavily on the Distributed NVML diagnostics module.

It was added a while ago (#5123), so I assumed that this was used by something.
Now I'm wondering if this is actually the case. If not, should we throw it out again? @mrocklin: Do you know more?

A more general question: Are NVML diagnostics independent of the system monitor? If not, I'm wondering if we should move the toggle there to be close to the other system monitor toggles distributed.admin.system-monitor.{disk|host-cpu|gil}.

@pentschev
Copy link
Member

This makes less sense to me. Shouldn't it instead be the other way around? If GPUs are generally enabled/disabled, then diagnostics should follow suit (possibly with the additional ability to turn off diagnostics either way).

Yes, I would agree with that, but Distributed had no GPU code (at least to my knowledge) other than monitoring, thus we were making the assumption in the opposite direction (diagnostics disabled == intent to have no GPU support).

A more general question: Are NVML diagnostics independent of the system monitor? If not, I'm wondering if we should move the toggle there to be close to the other system monitor toggles distributed.admin.system-monitor.{disk|host-cpu|gil}.

I don't mind moving the toggle, I think it was introduced in the NVML diagnostics module just to be the least intrusive as possible, if you see benefit in having it as part of system monitor I'm +1 for that.

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Nov 13, 2023

I don't mind moving the toggle, I think it was introduced in the NVML diagnostics module just to be the least intrusive as possible, if you see benefit in having it as part of system monitor I'm +1 for that.

Alright, I'll take another look at what distributed's NVML module does in detail to see if my proposal makes sense.

@jacobtomlinson
Copy link
Member

cc @dask/gpu for visibility

@fjetter
Copy link
Member

fjetter commented Dec 12, 2023

For context, Matt suggested in #5084 to add an annotation to all RAPIDS layers such that the tasks are scheduled on this magical threadpool

#5084 (comment)
#5084 (comment)

Given this conversation, I suspect this never happened? Would anybody be upset if we removed that?

@hendrikmakait hendrikmakait mentioned this issue Dec 12, 2023
2 tasks
@pentschev
Copy link
Member

I had completely forgotten about those discussions, sorry about that. I don't think that was ever made, for me personally I failed to see the benefits and use cases of doing that so it was probably never touched in RAPIDS/Dask-CUDA. No objections from me on removing that to reduce maintenance burden.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants