Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: "Datashade" acceleration of Distribution, and Bivariate elements #3954

Open
jonmmease opened this issue Sep 12, 2019 · 7 comments
Open

Comments

@jonmmease
Copy link
Collaborator

This proposal comes from thinking about how to support the Distribution and Bivarate elements at Datashader scale (e.g. hundreds of millions of points).

Unlike the current elements that have datashader support (Scatter, Curve, etc.), the challenge with Distribution and Bivariate isn't the amount of data that would otherwise be sent to the front-end. The challenge is the computational expense of performing the Gaussian kernel aggregation, and the fact that it is desirable to dynamically increase the resolution based on the zoom level.

I'm not sure what the API should look like, but what if the result of datashade(hv.Bivariate(...)) was a DynamicMap that returns a Contours element. This Contours element would be computed as follows:

  1. Use datashader's Canvas.points method to aggregate the input data into a single channel image (rasterize operation?)
  2. Convolve a Gaussian kernel of the desired bandwidth over the image (convolve operation with a Gaussian kernel?)
  3. Generate the Contours element with the (contours operation?)
@jbednar
Copy link
Member

jbednar commented Sep 12, 2019

Nice idea!

For Bivariate, the proposal makes good sense to me -- it trades off a tiny loss in precision in the location of each data point (essentially snapping each data point to a grid) for a potentially huge reduction in computational cost for summing all the kernels. Given that this loss in precision is related to the pixel size and not the data, arbitrarily high precision can still be obtained by zooming in, so the final result is very much in keeping with the Datashader ethos. It also doesn't seem particularly difficult to implement.

The operation here would be rasterize and not datashade, as there isn't any colormapping, but it hurts my head a little bit to think about precisely how to express it. My guess is hv.Bivariate(rasterize(hv.Points(...))), given that we need to first rasterize and then compute the smoothed density estimate, not the other way around. But Bivariate would have to then detect whether the input is a 2D Image and do the raster-based convolution instead of the usual point by point kernel summing.

For Distribution, all the same factors apply, except that it's now a one-dimensional aggregation, which is not directly supported by Datashader. It can be faked by having an extra dummy axis with a constant value, but it's more awkward than the bivariate case, particularly given that the rasterize() operation would create this dummy 2D but really 1D aggregate, and then Bivariate would have to work with the data in it under that assumption. Seems tricky!

@philippjfr
Copy link
Member

The operation chain would be bivariate_kde(rasterize(hv.Bivariate(...))) for 2d data. With a sufficiently large num_bins value we could also express Distribution as univariate_kde(histogram(hv.Distribution(...))).

@philippjfr
Copy link
Member

The main question for both cases is how to choose a reasonable value for the number of bins to avoid losing too much precision.

@philippjfr
Copy link
Member

Hmm actually neither proposal which uses the kde operations will work because kde operations always compute density and do not weight by bin value. Simple convolutions would be the only way to get approximations of the density estimate.

@jbednar
Copy link
Member

jbednar commented Sep 13, 2019

Seems like the number of bins can simply scale by the plot size in pixels, as there's no benefit to having it be larger than the plot size. Given the inherent smoothing, it can probably be some fraction of the plot size, but if so that can be decided empirically and probably would never need to be messed with after that. The aggregation time would dominate for large enough data, so we wouldn't have to push hard on reducing the number of bins.

@jonmmease
Copy link
Collaborator Author

jonmmease commented Sep 13, 2019

I actually got pretty close by chaining the current rasterize, convolve, and contours operations:

kernel = hv.Image(kernel_values, ...)
contours(convolve(rasterize(points) * kernel))

holoviews_bivariate

The only issue is that convolve seems to cause some wrapping artifacts around the edges. It's also less convenient to have to preconstruct the kernel as a separate Image object. So maybe all that's needed is a new gaussian_smooth operation (https://en.wikipaedia.org/wiki/Gaussian_blur) that would construct the kernel internally (with parametrized bandwidths), and perform a convolution with some padding to avoid the wrap around artifacts.

Then the following would work

contours(gaussian_smooth(rasterize(points)))

One last consideration is whether this smoothing should be performed only on those points that fall inside the viewport, or whether a buffer outside of the view port should also be considered so that the density around the edges of the viewport takes the larger dataset into account. This is probably more correct, but it would require rasterize to preserve some extra data outside of the viewport.

Maybe, in this case, rasterize itself has the option to perform smoothing. Maybe a blur_sigma parameter? If this is set, then rasterize would take care of calling datashader with a slightly larger Canvas (e.g. a 2*blur_sigma pixel buffer), then perform the convolution and trim the result back to the requested viewport.

@jbednar
Copy link
Member

jbednar commented Sep 13, 2019

That's a lot to think about!

Automatically smoothing Datashader output is a useful idea in general, and in fact it's highly related to the spreading support that I recently prototyped for Datashader aggregates (holoviz/datashader#771), which has maybe the opposite goal but potentially the same implementation. I.e., spreading is designed to make individual data points visible by convolving with a flat kernel, and here the goal is to smooth out lumpy distributions by convolving with a smooth kernel. Both of them seem like valuable things to offer, supporting different goals of the user.

I agree that it makes sense to do this at the rasterize operation level because of the buffering issue. I don't think any trimming would be needed necessarily; seems like it could just be situating the resulting array at the indicated viewport, with BokehJS automatically cropping off what's not visible around the edges. The buffer area would be potentially visible in a static export when panning, but that seems ok to me.

Note that rasterize is itself a high-level operation, delegating to aggregate() or regrid(), and in this case I'd propose that rasterize(hv.Bivariate() delegate to a new underlying operation that computes the appropriate kernel, computes the buffer size, calls aggregate with the appropriate size, and calls convolve on the result. That way users could use rasterize(hv.Bivariate(..)) the same way that they do rasterize on any other type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants