-
-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: "Datashade" acceleration of Distribution, and Bivariate elements #3954
Comments
Nice idea! For Bivariate, the proposal makes good sense to me -- it trades off a tiny loss in precision in the location of each data point (essentially snapping each data point to a grid) for a potentially huge reduction in computational cost for summing all the kernels. Given that this loss in precision is related to the pixel size and not the data, arbitrarily high precision can still be obtained by zooming in, so the final result is very much in keeping with the Datashader ethos. It also doesn't seem particularly difficult to implement. The operation here would be For Distribution, all the same factors apply, except that it's now a one-dimensional aggregation, which is not directly supported by Datashader. It can be faked by having an extra dummy axis with a constant value, but it's more awkward than the bivariate case, particularly given that the rasterize() operation would create this dummy 2D but really 1D aggregate, and then Bivariate would have to work with the data in it under that assumption. Seems tricky! |
The operation chain would be |
The main question for both cases is how to choose a reasonable value for the number of bins to avoid losing too much precision. |
Hmm actually neither proposal which uses the kde operations will work because kde operations always compute density and do not weight by bin value. Simple convolutions would be the only way to get approximations of the density estimate. |
Seems like the number of bins can simply scale by the plot size in pixels, as there's no benefit to having it be larger than the plot size. Given the inherent smoothing, it can probably be some fraction of the plot size, but if so that can be decided empirically and probably would never need to be messed with after that. The aggregation time would dominate for large enough data, so we wouldn't have to push hard on reducing the number of bins. |
I actually got pretty close by chaining the current kernel = hv.Image(kernel_values, ...)
contours(convolve(rasterize(points) * kernel)) The only issue is that Then the following would work contours(gaussian_smooth(rasterize(points))) One last consideration is whether this smoothing should be performed only on those points that fall inside the viewport, or whether a buffer outside of the view port should also be considered so that the density around the edges of the viewport takes the larger dataset into account. This is probably more correct, but it would require Maybe, in this case, |
That's a lot to think about! Automatically smoothing Datashader output is a useful idea in general, and in fact it's highly related to the spreading support that I recently prototyped for Datashader aggregates (holoviz/datashader#771), which has maybe the opposite goal but potentially the same implementation. I.e., spreading is designed to make individual data points visible by convolving with a flat kernel, and here the goal is to smooth out lumpy distributions by convolving with a smooth kernel. Both of them seem like valuable things to offer, supporting different goals of the user. I agree that it makes sense to do this at the rasterize operation level because of the buffering issue. I don't think any trimming would be needed necessarily; seems like it could just be situating the resulting array at the indicated viewport, with BokehJS automatically cropping off what's not visible around the edges. The buffer area would be potentially visible in a static export when panning, but that seems ok to me. Note that rasterize is itself a high-level operation, delegating to aggregate() or regrid(), and in this case I'd propose that |
This proposal comes from thinking about how to support the
Distribution
andBivarate
elements at Datashader scale (e.g. hundreds of millions of points).Unlike the current elements that have datashader support (
Scatter
,Curve
, etc.), the challenge withDistribution
andBivariate
isn't the amount of data that would otherwise be sent to the front-end. The challenge is the computational expense of performing the Gaussian kernel aggregation, and the fact that it is desirable to dynamically increase the resolution based on the zoom level.I'm not sure what the API should look like, but what if the result of
datashade(hv.Bivariate(...))
was aDynamicMap
that returns aContours
element. ThisContours
element would be computed as follows:Canvas.points
method to aggregate the input data into a single channel image (rasterize
operation?)convolve
operation with a Gaussian kernel?)Contours
element with the (contours
operation?)The text was updated successfully, but these errors were encountered: