Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU executor if GPU is present #5123

Merged
merged 6 commits into from
Aug 3, 2021
Merged

Conversation

mrocklin
Copy link
Member

See #5084

Comment on lines 638 to 643
try:
import pynvml
except ImportError:
pass
else:
if pynvml.nvmlDeviceGetCount():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to use device_get_count from distributed/diagnostics/nvml.py here

def device_get_count():
init_once()
if nvmlLibraryNotFound or not nvmlInitialized:
return 0
else:
return pynvml.nvmlDeviceGetCount()

cc @pentschev

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to gift this PR to anyone else (@pentschev seems like a natural recipient). Mostly I wanted to make the issue in #5084 more real.

Comment on lines 2016 to 2027
"foo": ThreadPoolExecutor(1, thread_name_prefix="Dask-Foo-Threads")
},
) as w:
async with Client(s.address, asynchronous=True) as c:
futures = []
with dask.annotate(executor="default"):
futures.append(c.submit(get_thread_name, pure=False))
with dask.annotate(executor="GPU"):
with dask.annotate(executor="foo"):
futures.append(c.submit(get_thread_name, pure=False))
default_result, gpu_result = await c.gather(futures)
default_result, foo_result = await c.gather(futures)
assert "Dask-Default-Threads" in default_result
assert "Dask-GPU-Threads" in gpu_result
assert "Dask-Foo-Threads" in foo_result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this test was totally fine before, I'm just changing the name of the extra executor from "GPU" to "foo" to avoid any potential confusion with the "gpu" executor which is added in this PR

@jrbourbeau
Copy link
Member

@jakirkham @pentschev @quasiben any thoughts on if this will make Dask's GPU experience smoother?

For reference, all this PR does is spin up an extra threadpool (with a single thread) on Dask workers when the device count from pynvml is >0. This by itself won't change any of Dask's default behavior. However it will allow users to specify in a straightforward way if they'd like tasks to be executed on this single-threaded executor instead of on the normal worker threadpool:

with dask.annotate(executor="gpu"):
    # my normal dask code here
    # all tasks submitted in this block will run
    # on the "gpu" threadpool

@quasiben
Copy link
Member

I'm not sure this make things smoother for the N+1 dask-worker processes, though perhaps this does make bring up for a single GPU workflow a bit easier. In this PR, if there are N+1 procs, all workers will be on the first GPU device (GPU 0). This can be a problem for both the case where this is a single GPU and multiple GPUs.

In the single GPU case, multiple workers on the same GPU can lead to unexpected OOM issues. In the multiple GPU case, the user will mostly like expect dask to leverage all the GPUs. Additionally, if we make this the default, this may complicate gpu workloads when using dask-cuda-workers -- there will be two executors which can perform "gpu" tasks.

I'm still thinking through this so these may be overly conservative opinions

@mrocklin
Copy link
Member Author

In this PR, if there are N+1 procs, all workers will be on the first GPU device (GPU 0). This can be a problem for both the case where this is a single GPU and multiple GPUs.

I'm not sure I understand. We're not focusing at all on CUDA_VISIBLE_DEVICES or anything. All we're saying here is that probably a dask worker is only managing one GPU, and so probably it should only run one of these tasks at a time.

@mrocklin
Copy link
Member Author

Additionally, if we make this the default, this may complicate gpu workloads when using dask-cuda-workers -- there will be two executors which can perform "gpu" tasks.

I would be curious to learn more about what dask-cuda does here. Also, presumably if we select the same name then there won't be a conflict. The more sophisticated dask-cuda will just do a better job for all of the tasks with the gpu executor specified

@quasiben
Copy link
Member

I'm not sure I understand. We're not focusing at all on CUDA_VISIBLE_DEVICES or anything. All we're saying here is that probably a dask worker is only managing one GPU, and so probably it should only run one of these tasks at a time.

Understood, do you see having multiple workers on the same node as a GPU ? If so, each worker will be on the same GPU. Is that a problem for the workloads you are interested in? What would you think about renaming the executor to single-gpu akin to the scheduler kwarg single-threaded ?

I would be curious to learn more about what dask-cuda does here. Also, presumably if we select the same name then there won't be a conflict. The more sophisticated dask-cuda will just do a better job for all of the tasks with the gpu executor specified

At the moment Dask-CUDA does not create any additional executors -- it assumes 1 process/1 thread per GPU on the default executor. Though Dask-CUDA could take the idea outlined in this PR and create additional CPU workers in a separate threadpool. This was something brought in rapidsai/dask-cuda#540. Doing this would mean users would specifically have to annotate CPU code rather than GPU. The reverse of what is being proposed here

One last question, do you think users will be confused with having multiple ways to execute GPU code with Dask ? We've spent a fair amount of time trying to educate users on optimal ways to leverage Dask and GPUs primarily through Dask-CUDA. Do you think we will have to frequently move users back and forth between Dask-CUDA and auto-detected GPU workers depending on their workload ? Re-reading this last question seems overly concerned but I'm choosing to leave it in for dramatic emphasis. It's been challenging maintaining the GPU pieces and ensuring everything works (though perhaps not as smooth as can be)

@mrocklin
Copy link
Member Author

Understood, do you see having multiple workers on the same node as a GPU ? If so, each worker will be on the same GPU. Is that a problem for the workloads you are interested in?

I think that we should also solve the multi-GPU problem. I view this as a first step. This change is orthogonal to the choice of how to manage CUDA_VISIBLE_DEVICES. We should also think about that though.

Though Dask-CUDA could take the idea outlined in this PR and create additional CPU workers in a separate threadpool. This was something brought in rapidsai/dask-cuda#540. Doing this would mean users would specifically have to annotate CPU code rather than GPU. The reverse of what is being proposed here

I think that CPUs are still more commonly used, and should probably be considered the default. I think that by adding the following to dask-cudf dataframe instances you would automatically use a single-threaded executor

df = new_dd_object(...)
df.annotations["executor"] = "gpu"

Then everything would, I think, work as everyone desires without the user having to specifically annotate code.

One last question, do you think users will be confused with having multiple ways to execute GPU code with Dask ? We've spent a fair amount of time trying to educate users on optimal ways to leverage Dask and GPUs primarily through Dask-CUDA

I think that RAPIDS users know about dask-cuda, but not so much with other folks. I would like to upstream what we can of Dask-CUDA so that the learnings of that project can have a wider impact. Obviously some of the changes in dask-cuda are very specific to RAPIDS work, and so that will be harder to upstream. This seems like a first step though. I would welcome your thinking in how to upstream other lessons learned in dask-cuda.

I don't expect dask-cuda will go away (I think that the RAPIDS team needs the freedom to innovate on their own) but maybe we can improve the lives of those folks who don't use it.

@quasiben
Copy link
Member

@rjzamora and I were going to chat about auto-annotation. I think this will take some thought/exploration. In the mean time, I would say to merge this PR and we can starting thinking through how to upstream other core parts of dask-cuda. I believe @charlesbluca and the OPs team are close to getting dask/distributed hooked into gpuCI for GPU testing

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the mean time, I would say to merge this PR and we can starting thinking through how to upstream other core parts of dask-cuda

That sounds great. FWIW the test added here passes on the gpuCI build. It's nice to be able to develop with a bit more certainty around how things will work on GPUs.

Will merge once CI finishes up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants