Add GPU executor if GPU is present #5123

mrocklin · 2021-07-26T20:42:56Z

jrbourbeau · 2021-07-26T20:48:55Z

distributed/worker.py

+        try:
+            import pynvml
+        except ImportError:
+            pass
+        else:
+            if pynvml.nvmlDeviceGetCount():


We probably want to use device_get_count from distributed/diagnostics/nvml.py here

distributed/distributed/diagnostics/nvml.py

Lines 33 to 38 in d1cd7fa

def device_get_count():

init_once()

if nvmlLibraryNotFound or not nvmlInitialized:

return 0

else:

return pynvml.nvmlDeviceGetCount()

cc @pentschev

I'm happy to gift this PR to anyone else (@pentschev seems like a natural recipient). Mostly I wanted to make the issue in #5084 more real.

jrbourbeau · 2021-07-27T18:24:01Z

distributed/tests/test_worker.py

+                "foo": ThreadPoolExecutor(1, thread_name_prefix="Dask-Foo-Threads")
            },
        ) as w:
            async with Client(s.address, asynchronous=True) as c:
                futures = []
                with dask.annotate(executor="default"):
                    futures.append(c.submit(get_thread_name, pure=False))
-                with dask.annotate(executor="GPU"):
+                with dask.annotate(executor="foo"):
                    futures.append(c.submit(get_thread_name, pure=False))
-                default_result, gpu_result = await c.gather(futures)
+                default_result, foo_result = await c.gather(futures)
                assert "Dask-Default-Threads" in default_result
-                assert "Dask-GPU-Threads" in gpu_result
+                assert "Dask-Foo-Threads" in foo_result


Note that this test was totally fine before, I'm just changing the name of the extra executor from "GPU" to "foo" to avoid any potential confusion with the "gpu" executor which is added in this PR

jrbourbeau · 2021-07-27T22:06:47Z

@jakirkham @pentschev @quasiben any thoughts on if this will make Dask's GPU experience smoother?

For reference, all this PR does is spin up an extra threadpool (with a single thread) on Dask workers when the device count from pynvml is >0. This by itself won't change any of Dask's default behavior. However it will allow users to specify in a straightforward way if they'd like tasks to be executed on this single-threaded executor instead of on the normal worker threadpool:

with dask.annotate(executor="gpu"):
    # my normal dask code here
    # all tasks submitted in this block will run
    # on the "gpu" threadpool

quasiben · 2021-07-28T13:23:28Z

I'm not sure this make things smoother for the N+1 dask-worker processes, though perhaps this does make bring up for a single GPU workflow a bit easier. In this PR, if there are N+1 procs, all workers will be on the first GPU device (GPU 0). This can be a problem for both the case where this is a single GPU and multiple GPUs.

In the single GPU case, multiple workers on the same GPU can lead to unexpected OOM issues. In the multiple GPU case, the user will mostly like expect dask to leverage all the GPUs. Additionally, if we make this the default, this may complicate gpu workloads when using dask-cuda-workers -- there will be two executors which can perform "gpu" tasks.

I'm still thinking through this so these may be overly conservative opinions

mrocklin · 2021-07-28T17:59:32Z

In this PR, if there are N+1 procs, all workers will be on the first GPU device (GPU 0). This can be a problem for both the case where this is a single GPU and multiple GPUs.

I'm not sure I understand. We're not focusing at all on CUDA_VISIBLE_DEVICES or anything. All we're saying here is that probably a dask worker is only managing one GPU, and so probably it should only run one of these tasks at a time.

mrocklin · 2021-07-28T18:00:24Z

Additionally, if we make this the default, this may complicate gpu workloads when using dask-cuda-workers -- there will be two executors which can perform "gpu" tasks.

I would be curious to learn more about what dask-cuda does here. Also, presumably if we select the same name then there won't be a conflict. The more sophisticated dask-cuda will just do a better job for all of the tasks with the gpu executor specified

quasiben · 2021-07-28T19:26:31Z

I'm not sure I understand. We're not focusing at all on CUDA_VISIBLE_DEVICES or anything. All we're saying here is that probably a dask worker is only managing one GPU, and so probably it should only run one of these tasks at a time.

Understood, do you see having multiple workers on the same node as a GPU ? If so, each worker will be on the same GPU. Is that a problem for the workloads you are interested in? What would you think about renaming the executor to single-gpu akin to the scheduler kwarg single-threaded ?

I would be curious to learn more about what dask-cuda does here. Also, presumably if we select the same name then there won't be a conflict. The more sophisticated dask-cuda will just do a better job for all of the tasks with the gpu executor specified

At the moment Dask-CUDA does not create any additional executors -- it assumes 1 process/1 thread per GPU on the default executor. Though Dask-CUDA could take the idea outlined in this PR and create additional CPU workers in a separate threadpool. This was something brought in rapidsai/dask-cuda#540. Doing this would mean users would specifically have to annotate CPU code rather than GPU. The reverse of what is being proposed here

One last question, do you think users will be confused with having multiple ways to execute GPU code with Dask ? We've spent a fair amount of time trying to educate users on optimal ways to leverage Dask and GPUs primarily through Dask-CUDA. Do you think we will have to frequently move users back and forth between Dask-CUDA and auto-detected GPU workers depending on their workload ? Re-reading this last question seems overly concerned but I'm choosing to leave it in for dramatic emphasis. It's been challenging maintaining the GPU pieces and ensuring everything works (though perhaps not as smooth as can be)

mrocklin · 2021-07-28T19:34:40Z

Understood, do you see having multiple workers on the same node as a GPU ? If so, each worker will be on the same GPU. Is that a problem for the workloads you are interested in?

I think that we should also solve the multi-GPU problem. I view this as a first step. This change is orthogonal to the choice of how to manage CUDA_VISIBLE_DEVICES. We should also think about that though.

Though Dask-CUDA could take the idea outlined in this PR and create additional CPU workers in a separate threadpool. This was something brought in rapidsai/dask-cuda#540. Doing this would mean users would specifically have to annotate CPU code rather than GPU. The reverse of what is being proposed here

I think that CPUs are still more commonly used, and should probably be considered the default. I think that by adding the following to dask-cudf dataframe instances you would automatically use a single-threaded executor

df = new_dd_object(...)
df.annotations["executor"] = "gpu"

Then everything would, I think, work as everyone desires without the user having to specifically annotate code.

One last question, do you think users will be confused with having multiple ways to execute GPU code with Dask ? We've spent a fair amount of time trying to educate users on optimal ways to leverage Dask and GPUs primarily through Dask-CUDA

I think that RAPIDS users know about dask-cuda, but not so much with other folks. I would like to upstream what we can of Dask-CUDA so that the learnings of that project can have a wider impact. Obviously some of the changes in dask-cuda are very specific to RAPIDS work, and so that will be harder to upstream. This seems like a first step though. I would welcome your thinking in how to upstream other lessons learned in dask-cuda.

I don't expect dask-cuda will go away (I think that the RAPIDS team needs the freedom to innovate on their own) but maybe we can improve the lives of those folks who don't use it.

quasiben · 2021-07-29T15:57:51Z

@rjzamora and I were going to chat about auto-annotation. I think this will take some thought/exploration. In the mean time, I would say to merge this PR and we can starting thinking through how to upstream other core parts of dask-cuda. I believe @charlesbluca and the OPs team are close to getting dask/distributed hooked into gpuCI for GPU testing

…xecutor

jrbourbeau

In the mean time, I would say to merge this PR and we can starting thinking through how to upstream other core parts of dask-cuda

That sounds great. FWIW the test added here passes on the gpuCI build. It's nice to be able to develop with a bit more certainty around how things will work on GPUs.

Will merge once CI finishes up

Add GPU executor if GPU is present

b4a2363

See dask#5084

mrocklin mentioned this pull request Jul 26, 2021

Set GPU ThreadPoolExecutor and set known libraries to use it #5084

Open

jrbourbeau reviewed Jul 26, 2021

View reviewed changes

jrbourbeau added 2 commits July 27, 2021 13:05

Use nvml.device_get_count() utility function

ede7076

Add initial test

21cb339

jrbourbeau reviewed Jul 27, 2021

View reviewed changes

jrbourbeau added 3 commits July 30, 2021 13:30

Merge branch 'main' of https://github.com/dask/distributed into gpu-e…

abf1c19

…xecutor

Merge branch 'main' of https://github.com/dask/distributed into gpu-e…

2cd2eec

…xecutor

Merge branch 'main' of https://github.com/dask/distributed into gpu-e…

b94aa57

…xecutor

jrbourbeau approved these changes Aug 2, 2021

View reviewed changes

jrbourbeau merged commit 6afe233 into dask:main Aug 3, 2021

hendrikmakait mentioned this pull request Nov 13, 2023

Disabling GPU diagnostics prevents GPU executor from getting created #8338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU executor if GPU is present #5123

Add GPU executor if GPU is present #5123

mrocklin commented Jul 26, 2021

jrbourbeau Jul 26, 2021

mrocklin Jul 26, 2021

jrbourbeau Jul 27, 2021

jrbourbeau commented Jul 27, 2021

quasiben commented Jul 28, 2021

mrocklin commented Jul 28, 2021

mrocklin commented Jul 28, 2021

quasiben commented Jul 28, 2021

mrocklin commented Jul 28, 2021

quasiben commented Jul 29, 2021

jrbourbeau left a comment

	def device_get_count():
	init_once()
	if nvmlLibraryNotFound or not nvmlInitialized:
	return 0
	else:
	return pynvml.nvmlDeviceGetCount()

Add GPU executor if GPU is present #5123

Add GPU executor if GPU is present #5123

Conversation

mrocklin commented Jul 26, 2021

jrbourbeau Jul 26, 2021

Choose a reason for hiding this comment

mrocklin Jul 26, 2021

Choose a reason for hiding this comment

jrbourbeau Jul 27, 2021

Choose a reason for hiding this comment

jrbourbeau commented Jul 27, 2021

quasiben commented Jul 28, 2021

mrocklin commented Jul 28, 2021

mrocklin commented Jul 28, 2021

quasiben commented Jul 28, 2021

mrocklin commented Jul 28, 2021

quasiben commented Jul 29, 2021

jrbourbeau left a comment

Choose a reason for hiding this comment