Disabling GPU diagnostics prevents GPU executor from getting created #8338

hendrikmakait · 2023-11-13T10:53:23Z

After filing #8337, I ran across distributed.diagnostics.nvml, which seems to be supposed to serve the same purpose of toggling diagnotics on GPUs. However, after taking a deeper look, it seems as though this setting has some unintended consequences and disables the GPU executor altogether:

In distributed.diagnostics.nvml, it is used within init_once to decide whether to initialize pynvml on a process:

distributed/distributed/diagnostics/nvml.py

Lines 89 to 93 in 0dc9e88

    
           elif NVML_STATE == NVMLState.UNINITIALIZED and not dask.config.get( 
        
               "distributed.diagnostics.nvml" 
        
           ): 
        
               NVML_STATE = NVMLState.DISABLED_CONFIG 
        
               return

The initialization is then checked within device_get_count(), which returns 0 if the toggle is turned off:

distributed/distributed/diagnostics/nvml.py

Lines 125 to 128 in 0dc9e88

    
           def device_get_count(): 
        
               init_once() 
        
               if not is_initialized(): 
        
                   return 0

On the worker, this function is then used to decide whether to initialize the GPU executor:

distributed/distributed/worker.py

Lines 639 to 642 in 0dc9e88

    
           if nvml.device_get_count() > 0: 
        
               self.executors["gpu"] = ThreadPoolExecutor( 
        
                   1, thread_name_prefix="Dask-GPU-Threads" 
        
               )

I'm not an expert on this particular area of code, but I think that we should separate initialization of pynvml from the diagnostics toggle. We may also want to consider moving diagnostics toggle into the system monitor (akin to #8337) to have a conceptual grouping.

The text was updated successfully, but these errors were encountered:

pentschev · 2023-11-13T12:24:10Z

I can speak for the NVML diagnostics side, and we intentionally chose to decide whether to enable diagnostics or not when device_get_count() == 0 which may indeed cause some confusion as it packs more information than its name suggests:

Is the NVML diagnostics toggle turned on?
Is pynvml installed?
Are NVML resources actually available (e.g., shared libraries)?
And finally, are there GPUs installed and visible?

We could separate all of those things but it adds a ton of boilerplate code that's easy to miss, even just having to test whether NVML is enabled and the device count is, from previous experience, easy enough to introduce unintended consequences. With that said, so far we assumed that if GPU diagnostics are disabled, everything GPU-related is intended to be disabled.

I wouldn't mind making changes so that we can still query whether GPUs are available even if the NVML toggle is off if there's use for you. I was not aware of the existence of Dask-GPU-Threads executor and I'm not familiar with its purpose, in Dask-CUDA we do not use it even though we rely heavily on the Distributed NVML diagnostics module. With that said, if having another path that can still keep Dask-GPU-Threads enabled despite NVML diagnostics switched off I'm ok with that, I would still like to retain the current behavior of device_get_count() == 0 if NVML diagnostics are switched off though, and we can then perhaps introduce a new function or bool argument to device_get_count to still return the device count independent of the NVML diagnostics toggle.

hendrikmakait · 2023-11-13T12:44:19Z

@pentschev: Thanks for the additional context!

I can speak for the NVML diagnostics side, and we intentionally chose to decide whether to enable diagnostics or not when device_get_count() == 0

That generally makes a lot of sense to me.

With that said, so far we assumed that if GPU diagnostics are disabled, everything GPU-related is intended to be disabled.

This makes less sense to me. Shouldn't it instead be the other way around? If GPUs are generally enabled/disabled, then diagnostics should follow suit (possibly with the additional ability to turn off diagnostics either way).

I was not aware of the existence of Dask-GPU-Threads executor and I'm not familiar with its purpose, in Dask-CUDA we do not use it even though we rely heavily on the Distributed NVML diagnostics module.

It was added a while ago (#5123), so I assumed that this was used by something.
Now I'm wondering if this is actually the case. If not, should we throw it out again? @mrocklin: Do you know more?

A more general question: Are NVML diagnostics independent of the system monitor? If not, I'm wondering if we should move the toggle there to be close to the other system monitor toggles distributed.admin.system-monitor.{disk|host-cpu|gil}.

pentschev · 2023-11-13T12:53:29Z

This makes less sense to me. Shouldn't it instead be the other way around? If GPUs are generally enabled/disabled, then diagnostics should follow suit (possibly with the additional ability to turn off diagnostics either way).

Yes, I would agree with that, but Distributed had no GPU code (at least to my knowledge) other than monitoring, thus we were making the assumption in the opposite direction (diagnostics disabled == intent to have no GPU support).

A more general question: Are NVML diagnostics independent of the system monitor? If not, I'm wondering if we should move the toggle there to be close to the other system monitor toggles distributed.admin.system-monitor.{disk|host-cpu|gil}.

I don't mind moving the toggle, I think it was introduced in the NVML diagnostics module just to be the least intrusive as possible, if you see benefit in having it as part of system monitor I'm +1 for that.

hendrikmakait · 2023-11-13T13:37:13Z

I don't mind moving the toggle, I think it was introduced in the NVML diagnostics module just to be the least intrusive as possible, if you see benefit in having it as part of system monitor I'm +1 for that.

Alright, I'll take another look at what distributed's NVML module does in detail to see if my proposal makes sense.

jacobtomlinson · 2023-11-13T16:07:16Z

cc @dask/gpu for visibility

fjetter · 2023-12-12T13:20:37Z

For context, Matt suggested in #5084 to add an annotation to all RAPIDS layers such that the tasks are scheduled on this magical threadpool

#5084 (comment)
#5084 (comment)

Given this conversation, I suspect this never happened? Would anybody be upset if we removed that?

pentschev · 2023-12-12T22:27:31Z

I had completely forgotten about those discussions, sorry about that. I don't think that was ever made, for me personally I failed to see the benefits and use cases of doing that so it was probably never touched in RAPIDS/Dask-CUDA. No objections from me on removing that to reduce maintenance burden.

github-actions bot added the needs triage label Nov 13, 2023

hendrikmakait added the bug Something is broken label Nov 13, 2023

jacobtomlinson mentioned this issue Nov 13, 2023

Add toggle for GPU metric collection to distributed.yaml #8337

Draft

2 tasks

fjetter assigned hendrikmakait Nov 15, 2023

hendrikmakait removed the needs triage label Dec 12, 2023

hendrikmakait mentioned this issue Dec 12, 2023

Remove GPU executor #8399

Merged

2 tasks

fjetter closed this as completed in #8399 Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disabling GPU diagnostics prevents GPU executor from getting created #8338

Disabling GPU diagnostics prevents GPU executor from getting created #8338

hendrikmakait commented Nov 13, 2023 •

edited

Loading

pentschev commented Nov 13, 2023

hendrikmakait commented Nov 13, 2023

pentschev commented Nov 13, 2023

hendrikmakait commented Nov 13, 2023 •

edited

Loading

jacobtomlinson commented Nov 13, 2023

fjetter commented Dec 12, 2023

pentschev commented Dec 12, 2023

Disabling GPU diagnostics prevents GPU executor from getting created #8338

Disabling GPU diagnostics prevents GPU executor from getting created #8338

Comments

hendrikmakait commented Nov 13, 2023 • edited Loading

pentschev commented Nov 13, 2023

hendrikmakait commented Nov 13, 2023

pentschev commented Nov 13, 2023

hendrikmakait commented Nov 13, 2023 • edited Loading

jacobtomlinson commented Nov 13, 2023

fjetter commented Dec 12, 2023

pentschev commented Dec 12, 2023

hendrikmakait commented Nov 13, 2023 •

edited

Loading

hendrikmakait commented Nov 13, 2023 •

edited

Loading