Optimization for heatmap aggregation with pandas #1174

philippjfr · 2017-03-05T11:47:18Z

A 5-10x speedup for HeatMap aggregation when using a pandas/dask interface.

data = [(i, j, np.random.rand()) for i in range(500) for j in range(500)]

%%timeit
hv.HeatMap(data)

Before:

1 loop, best of 3: 8.61 s per loop

After:

1 loop, best of 3: 800 ms per loop

jbednar · 2017-03-05T15:29:36Z

Looks good to me.

jlstevens · 2017-03-05T18:45:23Z

I find DF_INTERFACES a bit ugly but otherwise it looks good.

philippjfr · 2017-03-05T18:46:13Z

Anything naming you would prefer?

jlstevens · 2017-03-05T18:46:27Z

Could the interfaces not declare the sort of APIs they support?

philippjfr · 2017-03-05T18:48:16Z

Could the interfaces not declare the sort of APIs they support?

This is bypassing the interface API, because pandas/dask have optimized implementations for this operation.

jlstevens · 2017-03-05T18:50:35Z

To clarify: shouldn't the interfaces declare the third party APIs (dataframe-like, array-like) that the data supports? This is also the API assumed by the interface class itself.

jlstevens · 2017-03-05T18:52:55Z

Something like:

if 'Dataframe' in reindexed.interface.interface_type:
  ...

philippjfr · 2017-03-05T18:53:12Z

shouldn't the interfaces declare the third party APIs (dataframe-like, array-like) that the data supports?

Sure they could, although "dataframe-like" isn't a particular solid guarantee on how similar and extensive the API is.

jlstevens · 2017-03-05T18:55:04Z

I think the same can be argued for the current approach. For instance, maybe only dask and pandas should claim to support the same interface type. I think such a mechanism is cleaner than building a list based on successful/failing imports.

jlstevens · 2017-03-05T18:56:21Z

In addition, you might want to always have 'dask' ahead of 'dataframe' in datatypes if the former can be considered a more highly optimized version of the latter. If dask isn't installed, it won't be used.

philippjfr · 2017-03-05T18:58:46Z

In addition, you might want to always have 'dask' ahead of 'dataframe' in datatypes if the former can be considered a more highly optimized version of the latter.

That's not really a safe assumption, by default dask will simple use a single partition and may therefore actually be slower than pandas and their lazy nature is probably a bit surprising/confusing for a user who doesn't know anything about them.

jlstevens · 2017-03-05T19:00:31Z

Ok, sure: that suggestion was orthogonal to the original one about declaring the type of data interface anyhow.

philippjfr · 2017-03-05T20:42:49Z

Okay, I did two things in the end, I got rid of DF_INTERFACES and added a copy keyword argument to the Dataset.dframe method, which lets you avoid making copies if the data is already a dataframe.

jlstevens · 2017-03-05T20:58:19Z

holoviews/core/data/__init__.py

@@ -516,13 +516,17 @@ def get_dimension_type(self, dim):
        return self.interface.dimension_type(self, dim_obj)


-    def dframe(self, dimensions=None):
+    def dframe(self, dimensions=None, copy=True):


I'm not sure copy argument makes much sense if the element isn't already using a dataframe based interface - for other interfaces, don't you always have to create a new dataframe - which would be the same as copy being fixed to True?

That's true, it's more like avoid_copy, but I think providing a consistent API to get a hold of a dataframe with the minimal amount of overhead is useful.

That said, I'd also be fine having a utility for it instead.

jlstevens · 2017-03-05T21:00:37Z

holoviews/element/util.py

@@ -134,7 +139,13 @@ def _aggregate_dataset(self, obj, xcoords, ycoords):
        dtype = 'dataframe' if pd else 'dictionary'
        dense_data = Dataset(data, kdims=obj.kdims, vdims=obj.vdims, datatype=[dtype])
        concat_data = obj.interface.concatenate([dense_data, obj], datatype=[dtype])
-        agg = concat_data.reindex([xdim, ydim], vdims).aggregate([xdim, ydim], reduce_fn)
+        reindexed = concat_data.reindex([xdim, ydim], vdims)
+        if pd:


Why not use reindexed.interface.dframe(dimensions=None, copy=False) instead of exposing the copy keyword argument at the element level? For copy=False to work, you are already assuming a dataframe type interface is being used...

I suppose the other thing you could do is complain if copy=False is passed to the dframe method of any interface that isn't based on dataframes.

For copy=False to work, you are already assuming a dataframe type interface is being used...

Because then I need conditional branches for the "is already dataframe" and "convert to dataframe" paths again. I guess I agree copy is confusing because you might assume you can mutate the dataframe and have an effect on the original element if you don't make a copy, when the real point of it is to avoid making pointless copies.

Would there be any harm with the dataframe interfaces just avoiding pointless copies automatically? Then it doesn't have to be something the user needs to ever think about...

In my usage of dframe I often create it and then assign to it so that would be a bit of pain.

jlstevens · 2017-03-05T22:26:39Z

I feel the new approach using a utility is much nicer, thanks!

Tests are passing now. Merging.

github-actions · 2024-10-25T23:26:01Z

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Optimization for heatmap aggregation with pandas

3bef914

philippjfr force-pushed the heatmap_agg_speedup branch from 8ba8dd6 to 3bef914 Compare March 5, 2017 12:20

Improved handling of operations relying on pandas

48d0df4

jlstevens reviewed Mar 5, 2017

View reviewed changes

philippjfr force-pushed the heatmap_agg_speedup branch from cbb5b2e to fcb3504 Compare March 5, 2017 21:30

Added PandasInterface.as_dframe utility

2f3ba87

philippjfr force-pushed the heatmap_agg_speedup branch from fcb3504 to 2f3ba87 Compare March 5, 2017 21:37

jlstevens merged commit 5d90c72 into master Mar 5, 2017

philippjfr deleted the heatmap_agg_speedup branch April 11, 2017 12:30

github-actions bot locked as resolved and limited conversation to collaborators Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization for heatmap aggregation with pandas #1174

Optimization for heatmap aggregation with pandas #1174

philippjfr commented Mar 5, 2017 •

edited

Loading

jbednar commented Mar 5, 2017

jlstevens commented Mar 5, 2017

philippjfr commented Mar 5, 2017

jlstevens commented Mar 5, 2017

philippjfr commented Mar 5, 2017 •

edited

Loading

jlstevens commented Mar 5, 2017

jlstevens commented Mar 5, 2017 •

edited

Loading

philippjfr commented Mar 5, 2017

jlstevens commented Mar 5, 2017

jlstevens commented Mar 5, 2017 •

edited

Loading

philippjfr commented Mar 5, 2017 •

edited by jlstevens

Loading

jlstevens commented Mar 5, 2017 •

edited

Loading

philippjfr commented Mar 5, 2017

jlstevens Mar 5, 2017

philippjfr Mar 5, 2017

philippjfr Mar 5, 2017

jlstevens Mar 5, 2017

jlstevens Mar 5, 2017

philippjfr Mar 5, 2017

jlstevens Mar 5, 2017

philippjfr Mar 5, 2017

jlstevens commented Mar 5, 2017

github-actions bot commented Oct 25, 2024

Optimization for heatmap aggregation with pandas #1174

Optimization for heatmap aggregation with pandas #1174

Conversation

philippjfr commented Mar 5, 2017 • edited Loading

jbednar commented Mar 5, 2017

jlstevens commented Mar 5, 2017

philippjfr commented Mar 5, 2017

jlstevens commented Mar 5, 2017

philippjfr commented Mar 5, 2017 • edited Loading

jlstevens commented Mar 5, 2017

jlstevens commented Mar 5, 2017 • edited Loading

philippjfr commented Mar 5, 2017

jlstevens commented Mar 5, 2017

jlstevens commented Mar 5, 2017 • edited Loading

philippjfr commented Mar 5, 2017 • edited by jlstevens Loading

jlstevens commented Mar 5, 2017 • edited Loading

philippjfr commented Mar 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlstevens commented Mar 5, 2017

github-actions bot commented Oct 25, 2024

philippjfr commented Mar 5, 2017 •

edited

Loading

philippjfr commented Mar 5, 2017 •

edited

Loading

jlstevens commented Mar 5, 2017 •

edited

Loading

jlstevens commented Mar 5, 2017 •

edited

Loading

philippjfr commented Mar 5, 2017 •

edited by jlstevens

Loading

jlstevens commented Mar 5, 2017 •

edited

Loading