Initial GPU support #793

jonmmease · 2019-10-01T19:47:14Z

Overview

This PR adds initial GPU support to Datashader 🎉. This is implemented using a combination of the cudf and cupy libraries and numba's cuda support.

This work was inspired by the cuDataShader project: https://github.com/rapidsai/cuDataShader.

cc: @jbednar, @philippjfr, @exactlyallan

Supported Features

The following Datashader features can now be accelerated by an NVIDIA GPU supported by recent versions of cudf/cupy/numba.

Canvas.points rasterization
Canvas.line and Canvas.area rasterization
All reduction operations except var and std. The current algorithm for these is a single pass serial algorithm that doesn't extend to fine-grained parallelization. For GPU parallelization, I think we would want to use a two-pass algorithm (compute the mean during the first pass, then compute the sum of squared difference from the mean in the second pass), but this would require a bit more refactoring to support.
transfer_functions.shade (both 2D and 3D) inputs

For the points/line/area methods, GPU acceleration is enabled automatically when the input data frame is a cudf.DataFrame instance. In this case, the aggregation results are returned in an xarray DataArray that is backed by a cupy.ndarray instance (rather than a numpy.ndarray instance). This way the aggregation results remain in GPU memory.

The transfer_functions.shade function will be GPU accelerated if it is passed an xarray DataArray that is backed by a cupy.ndarray instance.

Performance

I created the following benchmark notebooks:

Benchmark Canvas.points: https://anaconda.org/jonmmease/gpu_datashader_points_pr/notebook
Benchmark Canvas.line: https://anaconda.org/jonmmease/gpu_datashader_lines_pr/notebook

For each of these notebooks, I compared the performance of passing a pandas DataFrame (single threaded CPU), a dask DataFrame with 12 partitions (12 threaded CPU on a 14 core workstation), and a cudf DataFrame (GeForce RTX 2080).

Points with `count` aggregate

Rendering ~100 million points

pandas: 1280 ms
dask: 284 ms (~4.5x speedup from pandas)
cudf: 31.2 ms (~40x speedup from pandas)

Points with `count_cat` aggregate

Rendering ~100 million points

pandas: 1690 ms
dask: 681 ms (~2.5x speedup from pandas)
cudf: 83.6 ms (~20x speedup from pandas)

Line

Rendering 1 million length-10 lines

pandas: 2060 ms
dask: 317 ms (~6.5x speedup from pandas)
cudf: 69.3 ms (~30x speedup from pandas)

Testing

The test suite is set up to run the GPU tests only if cupy and cudf are installed. We should talk about how we want to handle CI testing of the GPU code going forward.

jbednar · 2019-10-01T19:57:06Z

Fabulous! How does the performance compare to https://github.com/rapidsai/cuDataShader ?

jbednar · 2019-10-01T20:03:13Z

For var and std, can we supply separate two-pass implementations and select those when needed for cu-backed operations?

exactlyallan · 2019-10-01T20:43:15Z

Great work! Any plans to add other GPU accelerated features?
cc: @AjayThorve @WXBN

jonmmease · 2019-10-01T21:17:51Z

Fabulous! How does the performance compare

Good question. For the 100 million point test (with count reduction) this implementation is a bit faster.

pandas: 1280 ms
cudf (this PR): 31.2 ms (~40x speedup from pandas)
cuDataShader: 96 ms (13x speedup from pandas)

For 10 million, they are about the same:

pandas 176 ms
cudf (this PR): 13.6 ms (~13x speedup from pandas)
cuDataShader: 13.2 ms (~13x speedup from pandas)

For var and std, can we supply separate two-pass implementations and select those when needed for cu-backed operations?

Yes, I think we could. It's just that this paradigm can't be represented by the current reduction pipeline code, so it would take a bit of refactoring.

jonmmease · 2019-10-01T21:18:47Z

Thanks @exactlyallan!

Any plans to add other GPU accelerated features?

What features do you have in mind?

jbednar

Looks great!

As noted in a few specific comments below (e.g. about expanding bounds and similar tuples, plus replacing asserts with assert_eq_X, etc.), it looks like this PR includes a large number of code changes that appear to be refactorings not directly related to cuda support, though I assume they help enable and ease adding cuda support. With that in mind, would it be feasible for you to split this PR into two PRs? The first one would contain those refactorings, listing and justifying each one individually in a checklist, but without importing or using anything from cuda itself. The second PR could then focus on the cuda implementation on its own, which after the refactorings should be a relatively small amount of code.

I think this approach would help us validate the code changes better (particularly given that we can't easily run the GPU-related code for testing), and will help us debug things later if we find that any results or performance characteristics have changed due to the code in this PR (i.e. we can separate the effect of the refactoring from the effect of the GPU support itself). To do this, you could start with the final PR, and just go through and eliminate the branches and code paths that have to do with "cu", leaving all the rest for PR 1, then rebase the current PR on top of PR 1. PR 1 can then list, describe, and defend each of these changes one by one, which can be approved and merged independently of cuda support. Cuda support should then fit in easily after that, with only a few lines of code and all specific to cuda. Does that sound reasonable to you?

For PR1, maybe "data_library/" instead of "datastructures/"? I'm resistant to the small minority who are trying to make that be a single word (and I appear to be on the winning side: google ngrams :-), and it's more about which library is being used than about data structures per se.

datashader/glyphs/points.py

datashader/glyphs/line.py

datashader/tests/benchmarks/test_extend_line.py

datashader/transfer_functions/_cuda_utils.py

jbednar · 2019-10-01T21:59:48Z

That's great that the performance is the same or better than cuDatashader; the code changes involved generally seem quite reasonable and something we can maintain. I'm greatly relieved to find that overall things fit quite well into the current architecture, without sacrificing performance!

exactlyallan · 2019-10-01T22:07:21Z

We never got to GPU-ing trimesh, which we think we could work as a means to choropleth maps.

Also, I'm curious if GPU edge bundling makes sense too, like a version for hammer_bundle or something like cuDataShader edge Bundling notebook

jonmmease · 2019-10-01T22:23:00Z

@jbednar, thanks for taking a look. Yeah, I can definitely split this up more. As you suggest, I'll create a first PR with as much of the refactoring as possible and a more atomic commit log.

philippjfr · 2019-10-01T22:25:02Z

We never got to GPU-ing trimesh, which we think we could work as a means to choropleth maps.

I think we were planning on adding a general polygon implementation, which should be a lot more efficient than triangulating the polygon ahead of time. I'd really love to see both being GPU accelerated though.

jbednar · 2019-10-01T22:25:22Z

Thanks! You can make a more atomic commit log if you want, and it certainly doesn't hurt, but for my purposes a single commit for all the non-cuda work would be fine, as long as there is a checklist in the PR text that explains what the non-trivial changes are. If you want to do that by commit, sure; whatever's easier for you to think about!

jbednar · 2019-10-01T22:26:07Z

Right; I think a full polygon implementation will be much more efficient than a trimesh-based implementation, and we are already working on parts of that.

jonmmease · 2019-10-01T22:28:54Z

I think we were planning on adding a general polygon implementation, which should be a lot more efficient than triangulating the polygon ahead of time. I'd really love to see both being GPU accelerated though.

One wrinkle here is that cudf doesn't directly support pandas extension array types, so the RaggedArray type that is used in Datashader right now for variable length lines and that I proposed using for polygons in #181 (comment) won't work out of the box right now.

exactlyallan · 2019-10-01T23:49:50Z

Right; I think a full polygon implementation will be much more efficient than a trimesh-based implementation, and we are already working on parts of that.

Fantastic, rough ETA on that?

jbednar · 2019-10-01T23:54:54Z

I'm just now trying to coordinate that task between 5 or 6 people, so I don't know yet, but should soon know!

jorisvandenbossche · 2019-10-02T19:38:32Z

One wrinkle here is that cudf doesn't directly support pandas extension array types, so the RaggedArray type that is used in Datashader right now for variable length lines and that I proposed using for polygons in #181 (comment) won't work out of the box right now.

@jonmmease cudf is based on Arrow memory, right? (but I don't know the details how you interact with it on the python level) And I would also think that the RaggedArray you have for pandas should more or less directly map to Arrow's ListArray. Could that be a way to support this on the GPU?

jonmmease · 2019-10-02T20:58:50Z

@jorisvandenbossche, yes a ListArray sounds like the right DataStructure. I haven't looked into the internals of cudf, but this does sound like a good direction. Thanks!

jonmmease · 2019-10-07T18:10:22Z

Good news, it was pretty easy to add dask_cudf support on top of what I already had here. So multi-GPU support should also now work! I say should because I've only tested the dask_cudf support on a single GPU, in which case there is predictably no performance benefit. I have a second GPU on the way so I'll soon have a chance to test it out with two GPUs on the same machine.

jonmmease · 2019-10-09T18:54:13Z

I've confirmed that this PR operates correctly with a dask_cudf DataFrame distributed across multiple GPUs. Unfortunately, I haven't found a case yet where there's a performance benefit for doing so (compared to using a normal CPU dask DataFrame).

jonmmease · 2019-10-12T15:22:44Z

I've updated this PR based on master and added support for line/area with axis=0 (in addition to axis=1).

@philippjfr, when you have a chance. Could you try this out on your NVIDIA hardware? It would also be nice to see if your holoviz/holoviews#3982 PR works alright with these changes.

pzwang · 2019-10-12T16:18:46Z

I dunno why, but I somehow missed this huge news. Distributed GPU DataShader has been a dream of mine since the beginning... this is outstanding!

@jonmmease For the distributed GPU dataframe to show wins, maybe you just need a larger dataset... :-)

jbednar · 2019-10-12T19:17:45Z

@pzwang , you were probably just distracted a little bit by becoming CEO or some such. :-) Yes, surely there would be a benefit for a large enough dataset, but we'll see...

jonmmease · 2019-10-13T12:11:09Z

Thanks for the kind words @pzwang! I'm very excited about this as well.

When I push my 2 RTX 2080s to the limit, I can get almost 200 million rows of a 3 column DataFrame persisted. And this does indeed render in about 1/4 the time of a two worker CPU Dask LocalCluster (~1000 ms vs. ~260 ms).

Even so, it looks like there's larger constant overhead for the LocalCudaCluster compared to the normal LocalCluster. If I render only the first 1000 points of this DataFrame with the LocalCluster time is ~100ms, but with the LocalCudaCluster it's ~220ms. So perhaps there's some optimization that could be done in dask-cudf to lower this overhead to be more in line with dask.dataframe when used with the distributed scheduler.

kkraus14 · 2019-10-14T18:02:53Z

Hey all, cuDF maintainer here, I'd just like to mirror @pzwang's comment that this is extremely exciting!

When I push my 2 RTX 2080s to the limit, I can get almost 200 million rows of a 3 column DataFrame persisted. And this does indeed render in about 1/4 the time of a two worker CPU Dask LocalCluster (~1000 ms vs. ~260 ms).

Even so, it looks like there's larger constant overhead for the LocalCudaCluster compared to the normal LocalCluster. If I render only the first 1000 points of this DataFrame with the LocalCluster time is ~100ms, but with the LocalCudaCluster it's ~220ms. So perhaps there's some optimization that could be done in dask-cudf to lower this overhead to be more in line with dask.dataframe when used with the distributed scheduler.

Is there any way you can run the computation twice to see if the second run is much faster? You may be hitting an operation that we currently JIT compile using Numba and the JIT compilation overhead is typically ~200ms. If it's not JIT compilation then would it be possible to dump a cProfile that I can get some eyes on to see where we're currently spending our time? There's some big optimizations coming to cuDF with respect to avoiding a lot of slow Python control flow that I'm curious if you're hitting.

jonmmease · 2019-10-14T19:35:15Z

Hi @kkraus14, thanks for chiming in! The profiling I'm doing is running everything multiple times to give numba a chance to JIT everything since Datashader itself already uses numba extensively.

And to be clear, I'm seeing really impressive/exciting speedup when operating directly on a single cudf.DataFrame (See benchmarks in the top issue post). The place where I'm seeing some overhead that looks a bit larger than I would expect is only when operating on dask_cudf.DataFrame instances spread across multiple GPUs using a LocalCUDACluster.

We don't need to fully investigate this area before merging this PR, but I'll see if I can create a MWE that doesn't depend on Datashader that shows what I'm talking about.

jbednar · 2019-10-19T21:50:09Z

I'll go ahead and merge this, as we don't have an imminent release; hopefully other people will be able to test it soon!

jbednar · 2019-12-08T00:52:59Z

Right; I think a full polygon implementation will be much more efficient than a trimesh-based implementation, and we are already working on parts of that.
Fantastic, rough ETA on that?

The polygon implementation is ready! It's in #826 and should be merged very soon now that the GPU support has been released.

jbednar · 2019-12-11T02:14:00Z

For anyone wondering, I think this is the current status of support for the various data backends in Datashader for drawing the various glyph types available:

I'll try to find somewhere to put that in the docs.

Supporting GPU raster, trimesh, and quadmesh types should already be feasible, but those glyphs each use different rendering code that has to be implemented for GPUs separately, so it will take us a good bit of effort. #826 has now been merged, adding choropleth (polygon) and outline (multiline) rendering to Datashader, but further extending that to work with the GPU has to wait on support for Pandas ExtensionArrays in cuDF.

@kkraus14 and @exactlyallan, I don't know if NVIDIA is planning to support ExtensionArrays in cuDF, but it would open up some very cool applications! E.g. we've currently got a million-polygon choropleth dataset that takes 49 minutes to render one frame on the CPU with Datashader, which would surely get a big boost from the GPU.

We'd also love any help you can give with the LocalCudaCluster overhead issue, which I guess is waiting on an MRE from @jonmmease.

jbednar reviewed Oct 1, 2019

View reviewed changes

datashader/glyphs/points.py Outdated Show resolved Hide resolved

datashader/glyphs/line.py Outdated Show resolved Hide resolved

datashader/tests/benchmarks/test_extend_line.py Outdated Show resolved Hide resolved

datashader/transfer_functions/_cuda_utils.py Outdated Show resolved Hide resolved

jonmmease mentioned this pull request Oct 3, 2019

Refactoring towards GPU support #794

Merged

jonmmease force-pushed the gpu_support branch from 65b03a3 to 2533bbf Compare October 7, 2019 12:23

Add cudf/dask-cudf support for points/line/area rendering and shade

8d06620

jonmmease force-pushed the gpu_support branch from 4f85520 to 8d06620 Compare October 12, 2019 15:10

jbednar merged commit 3670f6d into master Oct 19, 2019

jbednar deleted the gpu_support branch October 19, 2019 21:50

This was referenced Oct 21, 2019

[FYI] GPU support now in datashader rapidsai/cuDataShader#2

Closed

[FEA] Help dataShader with GPU CI rapidsai/cuDataShader#3

Open

This was referenced Dec 12, 2019

GPU example #132

Closed

GPU accelerated viz library? davidrpugh/nvidia-rapids-data-science-project#3

Closed

gumdropsteve mentioned this pull request Apr 20, 2020

[FEA] .plot() method rapidsai/cudf#4948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial GPU support #793

Initial GPU support #793

jonmmease commented Oct 1, 2019 •

edited

Loading

jbednar commented Oct 1, 2019

jbednar commented Oct 1, 2019

exactlyallan commented Oct 1, 2019 •

edited

Loading

jonmmease commented Oct 1, 2019

jonmmease commented Oct 1, 2019

jbednar left a comment

jbednar commented Oct 1, 2019

exactlyallan commented Oct 1, 2019

jonmmease commented Oct 1, 2019

philippjfr commented Oct 1, 2019

jbednar commented Oct 1, 2019

jbednar commented Oct 1, 2019

jonmmease commented Oct 1, 2019

exactlyallan commented Oct 1, 2019

jbednar commented Oct 1, 2019

jorisvandenbossche commented Oct 2, 2019

jonmmease commented Oct 2, 2019

jonmmease commented Oct 7, 2019

jonmmease commented Oct 9, 2019

jonmmease commented Oct 12, 2019

pzwang commented Oct 12, 2019

jbednar commented Oct 12, 2019

jonmmease commented Oct 13, 2019

kkraus14 commented Oct 14, 2019

jonmmease commented Oct 14, 2019

jbednar commented Oct 19, 2019

jbednar commented Dec 8, 2019

jbednar commented Dec 11, 2019

Initial GPU support #793

Initial GPU support #793

Conversation

jonmmease commented Oct 1, 2019 • edited Loading

Overview

Supported Features

Performance

Points with count aggregate

Points with count_cat aggregate

Line

Testing

jbednar commented Oct 1, 2019

jbednar commented Oct 1, 2019

exactlyallan commented Oct 1, 2019 • edited Loading

jonmmease commented Oct 1, 2019

jonmmease commented Oct 1, 2019

jbednar left a comment

Choose a reason for hiding this comment

jbednar commented Oct 1, 2019

exactlyallan commented Oct 1, 2019

jonmmease commented Oct 1, 2019

philippjfr commented Oct 1, 2019

jbednar commented Oct 1, 2019

jbednar commented Oct 1, 2019

jonmmease commented Oct 1, 2019

exactlyallan commented Oct 1, 2019

jbednar commented Oct 1, 2019

jorisvandenbossche commented Oct 2, 2019

jonmmease commented Oct 2, 2019

jonmmease commented Oct 7, 2019

jonmmease commented Oct 9, 2019

jonmmease commented Oct 12, 2019

pzwang commented Oct 12, 2019

jbednar commented Oct 12, 2019

jonmmease commented Oct 13, 2019

kkraus14 commented Oct 14, 2019

jonmmease commented Oct 14, 2019

jbednar commented Oct 19, 2019

jbednar commented Dec 8, 2019

jbednar commented Dec 11, 2019

jonmmease commented Oct 1, 2019 •

edited

Loading

Points with `count` aggregate

Points with `count_cat` aggregate

exactlyallan commented Oct 1, 2019 •

edited

Loading