-
-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization for heatmap aggregation with pandas #1174
Conversation
8ba8dd6
to
3bef914
Compare
Looks good to me. |
I find |
Anything naming you would prefer? |
Could the interfaces not declare the sort of APIs they support? |
This is bypassing the interface API, because pandas/dask have optimized implementations for this operation. |
To clarify: shouldn't the interfaces declare the third party APIs (dataframe-like, array-like) that the data supports? This is also the API assumed by the interface class itself. |
Something like: if 'Dataframe' in reindexed.interface.interface_type:
... |
Sure they could, although "dataframe-like" isn't a particular solid guarantee on how similar and extensive the API is. |
I think the same can be argued for the current approach. For instance, maybe only dask and pandas should claim to support the same interface type. I think such a mechanism is cleaner than building a list based on successful/failing imports. |
In addition, you might want to always have |
That's not really a safe assumption, by default dask will simple use a single partition and may therefore actually be slower than pandas and their lazy nature is probably a bit surprising/confusing for a user who doesn't know anything about them. |
Ok, sure: that suggestion was orthogonal to the original one about declaring the type of data interface anyhow. |
Okay, I did two things in the end, I got rid of |
holoviews/core/data/__init__.py
Outdated
@@ -516,13 +516,17 @@ def get_dimension_type(self, dim): | |||
return self.interface.dimension_type(self, dim_obj) | |||
|
|||
|
|||
def dframe(self, dimensions=None): | |||
def dframe(self, dimensions=None, copy=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure copy argument makes much sense if the element isn't already using a dataframe based interface - for other interfaces, don't you always have to create a new dataframe - which would be the same as copy
being fixed to True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, it's more like avoid_copy
, but I think providing a consistent API to get a hold of a dataframe with the minimal amount of overhead is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I'd also be fine having a utility for it instead.
@@ -134,7 +139,13 @@ def _aggregate_dataset(self, obj, xcoords, ycoords): | |||
dtype = 'dataframe' if pd else 'dictionary' | |||
dense_data = Dataset(data, kdims=obj.kdims, vdims=obj.vdims, datatype=[dtype]) | |||
concat_data = obj.interface.concatenate([dense_data, obj], datatype=[dtype]) | |||
agg = concat_data.reindex([xdim, ydim], vdims).aggregate([xdim, ydim], reduce_fn) | |||
reindexed = concat_data.reindex([xdim, ydim], vdims) | |||
if pd: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use reindexed.interface.dframe(dimensions=None, copy=False)
instead of exposing the copy keyword argument at the element level? For copy=False
to work, you are already assuming a dataframe type interface is being used...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the other thing you could do is complain if copy=False
is passed to the dframe
method of any interface that isn't based on dataframes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For copy=False to work, you are already assuming a dataframe type interface is being used...
Because then I need conditional branches for the "is already dataframe" and "convert to dataframe" paths again. I guess I agree copy
is confusing because you might assume you can mutate the dataframe and have an effect on the original element if you don't make a copy, when the real point of it is to avoid making pointless copies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be any harm with the dataframe interfaces just avoiding pointless copies automatically? Then it doesn't have to be something the user needs to ever think about...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my usage of dframe I often create it and then assign to it so that would be a bit of pain.
cbb5b2e
to
fcb3504
Compare
fcb3504
to
2f3ba87
Compare
I feel the new approach using a utility is much nicer, thanks! Tests are passing now. Merging. |
This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
A 5-10x speedup for HeatMap aggregation when using a pandas/dask interface.
Before:
1 loop, best of 3: 8.61 s per loop
After:
1 loop, best of 3: 800 ms per loop