-
-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial GPU support #793
Initial GPU support #793
Conversation
Fabulous! How does the performance compare to https://github.com/rapidsai/cuDataShader ? |
For var and std, can we supply separate two-pass implementations and select those when needed for cu-backed operations? |
Great work! Any plans to add other GPU accelerated features? |
Good question. For the 100 million point test (with count reduction) this implementation is a bit faster.
For 10 million, they are about the same:
Yes, I think we could. It's just that this paradigm can't be represented by the current reduction pipeline code, so it would take a bit of refactoring. |
Thanks @exactlyallan!
What features do you have in mind? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
As noted in a few specific comments below (e.g. about expanding bounds
and similar tuples, plus replacing asserts with assert_eq_
X, etc.), it looks like this PR includes a large number of code changes that appear to be refactorings not directly related to cuda support, though I assume they help enable and ease adding cuda support. With that in mind, would it be feasible for you to split this PR into two PRs? The first one would contain those refactorings, listing and justifying each one individually in a checklist, but without importing or using anything from cuda itself. The second PR could then focus on the cuda implementation on its own, which after the refactorings should be a relatively small amount of code.
I think this approach would help us validate the code changes better (particularly given that we can't easily run the GPU-related code for testing), and will help us debug things later if we find that any results or performance characteristics have changed due to the code in this PR (i.e. we can separate the effect of the refactoring from the effect of the GPU support itself). To do this, you could start with the final PR, and just go through and eliminate the branches and code paths that have to do with "cu", leaving all the rest for PR 1, then rebase the current PR on top of PR 1. PR 1 can then list, describe, and defend each of these changes one by one, which can be approved and merged independently of cuda support. Cuda support should then fit in easily after that, with only a few lines of code and all specific to cuda. Does that sound reasonable to you?
For PR1, maybe "data_library/" instead of "datastructures/"? I'm resistant to the small minority who are trying to make that be a single word (and I appear to be on the winning side: google ngrams :-), and it's more about which library is being used than about data structures per se.
That's great that the performance is the same or better than cuDatashader; the code changes involved generally seem quite reasonable and something we can maintain. I'm greatly relieved to find that overall things fit quite well into the current architecture, without sacrificing performance! |
We never got to GPU-ing trimesh, which we think we could work as a means to choropleth maps. Also, I'm curious if GPU edge bundling makes sense too, like a version for hammer_bundle or something like cuDataShader edge Bundling notebook |
@jbednar, thanks for taking a look. Yeah, I can definitely split this up more. As you suggest, I'll create a first PR with as much of the refactoring as possible and a more atomic commit log. |
I think we were planning on adding a general polygon implementation, which should be a lot more efficient than triangulating the polygon ahead of time. I'd really love to see both being GPU accelerated though. |
Thanks! You can make a more atomic commit log if you want, and it certainly doesn't hurt, but for my purposes a single commit for all the non-cuda work would be fine, as long as there is a checklist in the PR text that explains what the non-trivial changes are. If you want to do that by commit, sure; whatever's easier for you to think about! |
Right; I think a full polygon implementation will be much more efficient than a trimesh-based implementation, and we are already working on parts of that. |
One wrinkle here is that cudf doesn't directly support pandas extension array types, so the |
Fantastic, rough ETA on that? |
I'm just now trying to coordinate that task between 5 or 6 people, so I don't know yet, but should soon know! |
@jonmmease cudf is based on Arrow memory, right? (but I don't know the details how you interact with it on the python level) And I would also think that the RaggedArray you have for pandas should more or less directly map to Arrow's ListArray. Could that be a way to support this on the GPU? |
@jorisvandenbossche, yes a |
Good news, it was pretty easy to add |
I've confirmed that this PR operates correctly with a |
4f85520
to
8d06620
Compare
I've updated this PR based on master and added support for @philippjfr, when you have a chance. Could you try this out on your NVIDIA hardware? It would also be nice to see if your holoviz/holoviews#3982 PR works alright with these changes. |
I dunno why, but I somehow missed this huge news. Distributed GPU DataShader has been a dream of mine since the beginning... this is outstanding! @jonmmease For the distributed GPU dataframe to show wins, maybe you just need a larger dataset... :-) |
@pzwang , you were probably just distracted a little bit by becoming CEO or some such. :-) Yes, surely there would be a benefit for a large enough dataset, but we'll see... |
Thanks for the kind words @pzwang! I'm very excited about this as well. When I push my 2 RTX 2080s to the limit, I can get almost 200 million rows of a 3 column DataFrame persisted. And this does indeed render in about 1/4 the time of a two worker CPU Dask LocalCluster (~1000 ms vs. ~260 ms). Even so, it looks like there's larger constant overhead for the |
Hey all, cuDF maintainer here, I'd just like to mirror @pzwang's comment that this is extremely exciting!
Is there any way you can run the computation twice to see if the second run is much faster? You may be hitting an operation that we currently JIT compile using Numba and the JIT compilation overhead is typically ~200ms. If it's not JIT compilation then would it be possible to dump a cProfile that I can get some eyes on to see where we're currently spending our time? There's some big optimizations coming to cuDF with respect to avoiding a lot of slow Python control flow that I'm curious if you're hitting. |
Hi @kkraus14, thanks for chiming in! The profiling I'm doing is running everything multiple times to give numba a chance to JIT everything since Datashader itself already uses numba extensively. And to be clear, I'm seeing really impressive/exciting speedup when operating directly on a single We don't need to fully investigate this area before merging this PR, but I'll see if I can create a MWE that doesn't depend on Datashader that shows what I'm talking about. |
I'll go ahead and merge this, as we don't have an imminent release; hopefully other people will be able to test it soon! |
The polygon implementation is ready! It's in #826 and should be merged very soon now that the GPU support has been released. |
For anyone wondering, I think this is the current status of support for the various data backends in Datashader for drawing the various glyph types available: I'll try to find somewhere to put that in the docs. Supporting GPU raster, trimesh, and quadmesh types should already be feasible, but those glyphs each use different rendering code that has to be implemented for GPUs separately, so it will take us a good bit of effort. #826 has now been merged, adding choropleth (polygon) and outline (multiline) rendering to Datashader, but further extending that to work with the GPU has to wait on support for Pandas ExtensionArrays in cuDF. @kkraus14 and @exactlyallan, I don't know if NVIDIA is planning to support ExtensionArrays in cuDF, but it would open up some very cool applications! E.g. we've currently got a million-polygon choropleth dataset that takes 49 minutes to render one frame on the CPU with Datashader, which would surely get a big boost from the GPU. We'd also love any help you can give with the LocalCudaCluster overhead issue, which I guess is waiting on an MRE from @jonmmease. |
Overview
This PR adds initial GPU support to Datashader 🎉. This is implemented using a combination of the
cudf
andcupy
libraries and numba's cuda support.This work was inspired by the cuDataShader project: https://github.com/rapidsai/cuDataShader.
cc: @jbednar, @philippjfr, @exactlyallan
Supported Features
The following Datashader features can now be accelerated by an NVIDIA GPU supported by recent versions of
cudf
/cupy
/numba
.Canvas.points
rasterizationCanvas.line
andCanvas.area
rasterizationvar
andstd
. The current algorithm for these is a single pass serial algorithm that doesn't extend to fine-grained parallelization. For GPU parallelization, I think we would want to use a two-pass algorithm (compute the mean during the first pass, then compute the sum of squared difference from the mean in the second pass), but this would require a bit more refactoring to support.transfer_functions.shade
(both 2D and 3D) inputsFor the
points
/line
/area
methods, GPU acceleration is enabled automatically when the input data frame is acudf.DataFrame
instance. In this case, the aggregation results are returned in an xarrayDataArray
that is backed by acupy.ndarray
instance (rather than anumpy.ndarray
instance). This way the aggregation results remain in GPU memory.The
transfer_functions.shade
function will be GPU accelerated if it is passed an xarrayDataArray
that is backed by acupy.ndarray
instance.Performance
I created the following benchmark notebooks:
Canvas.points
: https://anaconda.org/jonmmease/gpu_datashader_points_pr/notebookCanvas.line
: https://anaconda.org/jonmmease/gpu_datashader_lines_pr/notebookFor each of these notebooks, I compared the performance of passing a pandas DataFrame (single threaded CPU), a dask DataFrame with 12 partitions (12 threaded CPU on a 14 core workstation), and a cudf DataFrame (GeForce RTX 2080).
Points with
count
aggregateRendering ~100 million points
Points with
count_cat
aggregateRendering ~100 million points
Line
Rendering 1 million length-10 lines
Testing
The test suite is set up to run the GPU tests only if
cupy
andcudf
are installed. We should talk about how we want to handle CI testing of the GPU code going forward.