Proposal: "Datashade" acceleration of Distribution, and Bivariate elements #3954

jonmmease · 2019-09-12T13:30:05Z

This proposal comes from thinking about how to support the Distribution and Bivarate elements at Datashader scale (e.g. hundreds of millions of points).

Unlike the current elements that have datashader support (Scatter, Curve, etc.), the challenge with Distribution and Bivariate isn't the amount of data that would otherwise be sent to the front-end. The challenge is the computational expense of performing the Gaussian kernel aggregation, and the fact that it is desirable to dynamically increase the resolution based on the zoom level.

I'm not sure what the API should look like, but what if the result of datashade(hv.Bivariate(...)) was a DynamicMap that returns a Contours element. This Contours element would be computed as follows:

Use datashader's Canvas.points method to aggregate the input data into a single channel image (rasterize operation?)
Convolve a Gaussian kernel of the desired bandwidth over the image (convolve operation with a Gaussian kernel?)
Generate the Contours element with the (contours operation?)

The text was updated successfully, but these errors were encountered:

jbednar · 2019-09-12T21:51:12Z

Nice idea!

For Bivariate, the proposal makes good sense to me -- it trades off a tiny loss in precision in the location of each data point (essentially snapping each data point to a grid) for a potentially huge reduction in computational cost for summing all the kernels. Given that this loss in precision is related to the pixel size and not the data, arbitrarily high precision can still be obtained by zooming in, so the final result is very much in keeping with the Datashader ethos. It also doesn't seem particularly difficult to implement.

The operation here would be rasterize and not datashade, as there isn't any colormapping, but it hurts my head a little bit to think about precisely how to express it. My guess is hv.Bivariate(rasterize(hv.Points(...))), given that we need to first rasterize and then compute the smoothed density estimate, not the other way around. But Bivariate would have to then detect whether the input is a 2D Image and do the raster-based convolution instead of the usual point by point kernel summing.

For Distribution, all the same factors apply, except that it's now a one-dimensional aggregation, which is not directly supported by Datashader. It can be faked by having an extra dummy axis with a constant value, but it's more awkward than the bivariate case, particularly given that the rasterize() operation would create this dummy 2D but really 1D aggregate, and then Bivariate would have to work with the data in it under that assumption. Seems tricky!

philippjfr · 2019-09-12T23:20:33Z

The operation chain would be bivariate_kde(rasterize(hv.Bivariate(...))) for 2d data. With a sufficiently large num_bins value we could also express Distribution as univariate_kde(histogram(hv.Distribution(...))).

philippjfr · 2019-09-12T23:21:42Z

The main question for both cases is how to choose a reasonable value for the number of bins to avoid losing too much precision.

philippjfr · 2019-09-12T23:24:40Z

Hmm actually neither proposal which uses the kde operations will work because kde operations always compute density and do not weight by bin value. Simple convolutions would be the only way to get approximations of the density estimate.

jbednar · 2019-09-13T02:57:19Z

Seems like the number of bins can simply scale by the plot size in pixels, as there's no benefit to having it be larger than the plot size. Given the inherent smoothing, it can probably be some fraction of the plot size, but if so that can be decided empirically and probably would never need to be messed with after that. The aggregation time would dominate for large enough data, so we wouldn't have to push hard on reducing the number of bins.

jonmmease · 2019-09-13T10:25:52Z

I actually got pretty close by chaining the current rasterize, convolve, and contours operations:

kernel = hv.Image(kernel_values, ...)
contours(convolve(rasterize(points) * kernel))

The only issue is that convolve seems to cause some wrapping artifacts around the edges. It's also less convenient to have to preconstruct the kernel as a separate Image object. So maybe all that's needed is a new gaussian_smooth operation (https://en.wikipaedia.org/wiki/Gaussian_blur) that would construct the kernel internally (with parametrized bandwidths), and perform a convolution with some padding to avoid the wrap around artifacts.

Then the following would work

contours(gaussian_smooth(rasterize(points)))

One last consideration is whether this smoothing should be performed only on those points that fall inside the viewport, or whether a buffer outside of the view port should also be considered so that the density around the edges of the viewport takes the larger dataset into account. This is probably more correct, but it would require rasterize to preserve some extra data outside of the viewport.

Maybe, in this case, rasterize itself has the option to perform smoothing. Maybe a blur_sigma parameter? If this is set, then rasterize would take care of calling datashader with a slightly larger Canvas (e.g. a 2*blur_sigma pixel buffer), then perform the convolution and trim the result back to the requested viewport.

jbednar · 2019-09-13T16:02:17Z

That's a lot to think about!

Automatically smoothing Datashader output is a useful idea in general, and in fact it's highly related to the spreading support that I recently prototyped for Datashader aggregates (holoviz/datashader#771), which has maybe the opposite goal but potentially the same implementation. I.e., spreading is designed to make individual data points visible by convolving with a flat kernel, and here the goal is to smooth out lumpy distributions by convolving with a smooth kernel. Both of them seem like valuable things to offer, supporting different goals of the user.

I agree that it makes sense to do this at the rasterize operation level because of the buffering issue. I don't think any trimming would be needed necessarily; seems like it could just be situating the resulting array at the indicated viewport, with BokehJS automatically cropping off what's not visible around the edges. The buffer area would be potentially visible in a static export when panning, but that seems ok to me.

Note that rasterize is itself a high-level operation, delegating to aggregate() or regrid(), and in this case I'd propose that rasterize(hv.Bivariate() delegate to a new underlying operation that computes the appropriate kernel, computes the buffer size, calls aggregate with the appropriate size, and calls convolve on the result. That way users could use rasterize(hv.Bivariate(..)) the same way that they do rasterize on any other type.

philippjfr mentioned this issue May 30, 2020

Add support for cupy histograms and box-whisker stats #4447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: "Datashade" acceleration of Distribution, and Bivariate elements #3954

Proposal: "Datashade" acceleration of Distribution, and Bivariate elements #3954

jonmmease commented Sep 12, 2019

jbednar commented Sep 12, 2019 •

edited

Loading

philippjfr commented Sep 12, 2019

philippjfr commented Sep 12, 2019

philippjfr commented Sep 12, 2019

jbednar commented Sep 13, 2019

jonmmease commented Sep 13, 2019 •

edited by jbednar

Loading

jbednar commented Sep 13, 2019 •

edited

Loading

Proposal: "Datashade" acceleration of Distribution, and Bivariate elements #3954

Proposal: "Datashade" acceleration of Distribution, and Bivariate elements #3954

Comments

jonmmease commented Sep 12, 2019

jbednar commented Sep 12, 2019 • edited Loading

philippjfr commented Sep 12, 2019

philippjfr commented Sep 12, 2019

philippjfr commented Sep 12, 2019

jbednar commented Sep 13, 2019

jonmmease commented Sep 13, 2019 • edited by jbednar Loading

jbednar commented Sep 13, 2019 • edited Loading

jbednar commented Sep 12, 2019 •

edited

Loading

jonmmease commented Sep 13, 2019 •

edited by jbednar

Loading

jbednar commented Sep 13, 2019 •

edited

Loading