Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial GPU support #793

Merged
merged 1 commit into from
Oct 19, 2019
Merged

Initial GPU support #793

merged 1 commit into from
Oct 19, 2019

Conversation

jonmmease
Copy link
Collaborator

@jonmmease jonmmease commented Oct 1, 2019

Overview

This PR adds initial GPU support to Datashader 🎉. This is implemented using a combination of the cudf and cupy libraries and numba's cuda support.

This work was inspired by the cuDataShader project: https://github.com/rapidsai/cuDataShader.

cc: @jbednar, @philippjfr, @exactlyallan

Supported Features

The following Datashader features can now be accelerated by an NVIDIA GPU supported by recent versions of cudf/cupy/numba.

  • Canvas.points rasterization
  • Canvas.line and Canvas.area rasterization
  • All reduction operations except var and std. The current algorithm for these is a single pass serial algorithm that doesn't extend to fine-grained parallelization. For GPU parallelization, I think we would want to use a two-pass algorithm (compute the mean during the first pass, then compute the sum of squared difference from the mean in the second pass), but this would require a bit more refactoring to support.
  • transfer_functions.shade (both 2D and 3D) inputs

For the points/line/area methods, GPU acceleration is enabled automatically when the input data frame is a cudf.DataFrame instance. In this case, the aggregation results are returned in an xarray DataArray that is backed by a cupy.ndarray instance (rather than a numpy.ndarray instance). This way the aggregation results remain in GPU memory.

The transfer_functions.shade function will be GPU accelerated if it is passed an xarray DataArray that is backed by a cupy.ndarray instance.

Performance

I created the following benchmark notebooks:

For each of these notebooks, I compared the performance of passing a pandas DataFrame (single threaded CPU), a dask DataFrame with 12 partitions (12 threaded CPU on a 14 core workstation), and a cudf DataFrame (GeForce RTX 2080).

Points with count aggregate

Rendering ~100 million points

  • pandas: 1280 ms
  • dask: 284 ms (~4.5x speedup from pandas)
  • cudf: 31.2 ms (~40x speedup from pandas)

Points with count_cat aggregate

Rendering ~100 million points

  • pandas: 1690 ms
  • dask: 681 ms (~2.5x speedup from pandas)
  • cudf: 83.6 ms (~20x speedup from pandas)

Line

Rendering 1 million length-10 lines

  • pandas: 2060 ms
  • dask: 317 ms (~6.5x speedup from pandas)
  • cudf: 69.3 ms (~30x speedup from pandas)

Testing

The test suite is set up to run the GPU tests only if cupy and cudf are installed. We should talk about how we want to handle CI testing of the GPU code going forward.

@jbednar
Copy link
Member

jbednar commented Oct 1, 2019

Fabulous! How does the performance compare to https://github.com/rapidsai/cuDataShader ?

@jbednar
Copy link
Member

jbednar commented Oct 1, 2019

For var and std, can we supply separate two-pass implementations and select those when needed for cu-backed operations?

@exactlyallan
Copy link

exactlyallan commented Oct 1, 2019

Great work! Any plans to add other GPU accelerated features?
cc: @AjayThorve @WXBN

@jonmmease
Copy link
Collaborator Author

Fabulous! How does the performance compare

Good question. For the 100 million point test (with count reduction) this implementation is a bit faster.

  • pandas: 1280 ms
  • cudf (this PR): 31.2 ms (~40x speedup from pandas)
  • cuDataShader: 96 ms (13x speedup from pandas)

For 10 million, they are about the same:

  • pandas 176 ms
  • cudf (this PR): 13.6 ms (~13x speedup from pandas)
  • cuDataShader: 13.2 ms (~13x speedup from pandas)

For var and std, can we supply separate two-pass implementations and select those when needed for cu-backed operations?

Yes, I think we could. It's just that this paradigm can't be represented by the current reduction pipeline code, so it would take a bit of refactoring.

@jonmmease
Copy link
Collaborator Author

Thanks @exactlyallan!

Any plans to add other GPU accelerated features?

What features do you have in mind?

Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

As noted in a few specific comments below (e.g. about expanding bounds and similar tuples, plus replacing asserts with assert_eq_X, etc.), it looks like this PR includes a large number of code changes that appear to be refactorings not directly related to cuda support, though I assume they help enable and ease adding cuda support. With that in mind, would it be feasible for you to split this PR into two PRs? The first one would contain those refactorings, listing and justifying each one individually in a checklist, but without importing or using anything from cuda itself. The second PR could then focus on the cuda implementation on its own, which after the refactorings should be a relatively small amount of code.

I think this approach would help us validate the code changes better (particularly given that we can't easily run the GPU-related code for testing), and will help us debug things later if we find that any results or performance characteristics have changed due to the code in this PR (i.e. we can separate the effect of the refactoring from the effect of the GPU support itself). To do this, you could start with the final PR, and just go through and eliminate the branches and code paths that have to do with "cu", leaving all the rest for PR 1, then rebase the current PR on top of PR 1. PR 1 can then list, describe, and defend each of these changes one by one, which can be approved and merged independently of cuda support. Cuda support should then fit in easily after that, with only a few lines of code and all specific to cuda. Does that sound reasonable to you?

For PR1, maybe "data_library/" instead of "datastructures/"? I'm resistant to the small minority who are trying to make that be a single word (and I appear to be on the winning side: google ngrams :-), and it's more about which library is being used than about data structures per se.

@jbednar
Copy link
Member

jbednar commented Oct 1, 2019

That's great that the performance is the same or better than cuDatashader; the code changes involved generally seem quite reasonable and something we can maintain. I'm greatly relieved to find that overall things fit quite well into the current architecture, without sacrificing performance!

@exactlyallan
Copy link

We never got to GPU-ing trimesh, which we think we could work as a means to choropleth maps.

Also, I'm curious if GPU edge bundling makes sense too, like a version for hammer_bundle or something like cuDataShader edge Bundling notebook

@jonmmease
Copy link
Collaborator Author

@jbednar, thanks for taking a look. Yeah, I can definitely split this up more. As you suggest, I'll create a first PR with as much of the refactoring as possible and a more atomic commit log.

@philippjfr
Copy link
Member

We never got to GPU-ing trimesh, which we think we could work as a means to choropleth maps.

I think we were planning on adding a general polygon implementation, which should be a lot more efficient than triangulating the polygon ahead of time. I'd really love to see both being GPU accelerated though.

@jbednar
Copy link
Member

jbednar commented Oct 1, 2019

Thanks! You can make a more atomic commit log if you want, and it certainly doesn't hurt, but for my purposes a single commit for all the non-cuda work would be fine, as long as there is a checklist in the PR text that explains what the non-trivial changes are. If you want to do that by commit, sure; whatever's easier for you to think about!

@jbednar
Copy link
Member

jbednar commented Oct 1, 2019

Right; I think a full polygon implementation will be much more efficient than a trimesh-based implementation, and we are already working on parts of that.

@jonmmease
Copy link
Collaborator Author

I think we were planning on adding a general polygon implementation, which should be a lot more efficient than triangulating the polygon ahead of time. I'd really love to see both being GPU accelerated though.

One wrinkle here is that cudf doesn't directly support pandas extension array types, so the RaggedArray type that is used in Datashader right now for variable length lines and that I proposed using for polygons in #181 (comment) won't work out of the box right now.

@exactlyallan
Copy link

Right; I think a full polygon implementation will be much more efficient than a trimesh-based implementation, and we are already working on parts of that.

Fantastic, rough ETA on that?

@jbednar
Copy link
Member

jbednar commented Oct 1, 2019

I'm just now trying to coordinate that task between 5 or 6 people, so I don't know yet, but should soon know!

@jorisvandenbossche
Copy link

One wrinkle here is that cudf doesn't directly support pandas extension array types, so the RaggedArray type that is used in Datashader right now for variable length lines and that I proposed using for polygons in #181 (comment) won't work out of the box right now.

@jonmmease cudf is based on Arrow memory, right? (but I don't know the details how you interact with it on the python level) And I would also think that the RaggedArray you have for pandas should more or less directly map to Arrow's ListArray. Could that be a way to support this on the GPU?

@jonmmease
Copy link
Collaborator Author

@jorisvandenbossche, yes a ListArray sounds like the right DataStructure. I haven't looked into the internals of cudf, but this does sound like a good direction. Thanks!

@jonmmease
Copy link
Collaborator Author

Good news, it was pretty easy to add dask_cudf support on top of what I already had here. So multi-GPU support should also now work! I say should because I've only tested the dask_cudf support on a single GPU, in which case there is predictably no performance benefit. I have a second GPU on the way so I'll soon have a chance to test it out with two GPUs on the same machine.

@jonmmease
Copy link
Collaborator Author

I've confirmed that this PR operates correctly with a dask_cudf DataFrame distributed across multiple GPUs. Unfortunately, I haven't found a case yet where there's a performance benefit for doing so (compared to using a normal CPU dask DataFrame).

@jonmmease
Copy link
Collaborator Author

I've updated this PR based on master and added support for line/area with axis=0 (in addition to axis=1).

@philippjfr, when you have a chance. Could you try this out on your NVIDIA hardware? It would also be nice to see if your holoviz/holoviews#3982 PR works alright with these changes.

@pzwang
Copy link

pzwang commented Oct 12, 2019

I dunno why, but I somehow missed this huge news. Distributed GPU DataShader has been a dream of mine since the beginning... this is outstanding!

@jonmmease For the distributed GPU dataframe to show wins, maybe you just need a larger dataset... :-)

@jbednar
Copy link
Member

jbednar commented Oct 12, 2019

@pzwang , you were probably just distracted a little bit by becoming CEO or some such. :-) Yes, surely there would be a benefit for a large enough dataset, but we'll see...

@jonmmease
Copy link
Collaborator Author

Thanks for the kind words @pzwang! I'm very excited about this as well.

When I push my 2 RTX 2080s to the limit, I can get almost 200 million rows of a 3 column DataFrame persisted. And this does indeed render in about 1/4 the time of a two worker CPU Dask LocalCluster (~1000 ms vs. ~260 ms).

Even so, it looks like there's larger constant overhead for the LocalCudaCluster compared to the normal LocalCluster. If I render only the first 1000 points of this DataFrame with the LocalCluster time is ~100ms, but with the LocalCudaCluster it's ~220ms. So perhaps there's some optimization that could be done in dask-cudf to lower this overhead to be more in line with dask.dataframe when used with the distributed scheduler.

@kkraus14
Copy link

Hey all, cuDF maintainer here, I'd just like to mirror @pzwang's comment that this is extremely exciting!

When I push my 2 RTX 2080s to the limit, I can get almost 200 million rows of a 3 column DataFrame persisted. And this does indeed render in about 1/4 the time of a two worker CPU Dask LocalCluster (~1000 ms vs. ~260 ms).

Even so, it looks like there's larger constant overhead for the LocalCudaCluster compared to the normal LocalCluster. If I render only the first 1000 points of this DataFrame with the LocalCluster time is ~100ms, but with the LocalCudaCluster it's ~220ms. So perhaps there's some optimization that could be done in dask-cudf to lower this overhead to be more in line with dask.dataframe when used with the distributed scheduler.

Is there any way you can run the computation twice to see if the second run is much faster? You may be hitting an operation that we currently JIT compile using Numba and the JIT compilation overhead is typically ~200ms. If it's not JIT compilation then would it be possible to dump a cProfile that I can get some eyes on to see where we're currently spending our time? There's some big optimizations coming to cuDF with respect to avoiding a lot of slow Python control flow that I'm curious if you're hitting.

@jonmmease
Copy link
Collaborator Author

Hi @kkraus14, thanks for chiming in! The profiling I'm doing is running everything multiple times to give numba a chance to JIT everything since Datashader itself already uses numba extensively.

And to be clear, I'm seeing really impressive/exciting speedup when operating directly on a single cudf.DataFrame (See benchmarks in the top issue post). The place where I'm seeing some overhead that looks a bit larger than I would expect is only when operating on dask_cudf.DataFrame instances spread across multiple GPUs using a LocalCUDACluster.

We don't need to fully investigate this area before merging this PR, but I'll see if I can create a MWE that doesn't depend on Datashader that shows what I'm talking about.

@jbednar
Copy link
Member

jbednar commented Oct 19, 2019

I'll go ahead and merge this, as we don't have an imminent release; hopefully other people will be able to test it soon!

@jbednar
Copy link
Member

jbednar commented Dec 8, 2019

Right; I think a full polygon implementation will be much more efficient than a trimesh-based implementation, and we are already working on parts of that.
Fantastic, rough ETA on that?

The polygon implementation is ready! It's in #826 and should be merged very soon now that the GPU support has been released.

@jbednar
Copy link
Member

jbednar commented Dec 11, 2019

For anyone wondering, I think this is the current status of support for the various data backends in Datashader for drawing the various glyph types available:

image

I'll try to find somewhere to put that in the docs.

Supporting GPU raster, trimesh, and quadmesh types should already be feasible, but those glyphs each use different rendering code that has to be implemented for GPUs separately, so it will take us a good bit of effort. #826 has now been merged, adding choropleth (polygon) and outline (multiline) rendering to Datashader, but further extending that to work with the GPU has to wait on support for Pandas ExtensionArrays in cuDF.

@kkraus14 and @exactlyallan, I don't know if NVIDIA is planning to support ExtensionArrays in cuDF, but it would open up some very cool applications! E.g. we've currently got a million-polygon choropleth dataset that takes 49 minutes to render one frame on the CPU with Datashader, which would surely get a big boost from the GPU.

We'd also love any help you can give with the LocalCudaCluster overhead issue, which I guess is waiting on an MRE from @jonmmease.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants