Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize handling of wide datasets and Pandas indexes #1350

Merged
merged 15 commits into from
Jul 23, 2024

Conversation

maximlt
Copy link
Member

@maximlt maximlt commented Jun 4, 2024

By avoiding calling redim on each element of an NdOverlay which meant that every element had its own copy of the data


Edit: This PR bundles a few optimizations:

  • 1e61065: as described above, .redim is no longer called the elements of an NdOverlay so they can share the same data. This is demonstrated with the example below, the data wrapped by each element of the NdOverlay is now the same object.
import hvplot.pandas  # noqa
from bokeh.sampledata.degrees import data as deg

plot = deg.hvplot.line(x='Year', y=['Art and Performance', 'Business', 'Biology', 'Education', 'Computer Science'])
plot
image
plot['Business'].data is plot['Biology'].data  # True now :)
  • 326ee64: HoloViews added in Implement support for retaining Pandas index holoviews#6061 support for retaining Pandas indexes, which means that hvPlot no longer needs to call reset_index in a few cases, avoiding creating internal copies of the data. This change was made in 326ee64, by skipping some reset_index calls when the type of the dataset is a Pandas DataFrame (so far the Pandas Interface in HoloViews is the only one to have this support, so not Dask, not Geopandas, etc. which still require the reset_index calls). The preceding commits add various tests to make sure all the reset_index calls are at least called once in the unit tests. This commit adbddf7 updates the instantiation of a Dataset object with an explicit list of kdims instead letting HoloViews infer them, as HoloViews when given a Pandas DataFrame doesn't include its indexes in the list of automatically inferred kdims.

  • abb485a: Only redim and/or relabel a HoloViews element when needed, as suggested in Optimize handling of wide datasets and Pandas indexes  #1350 (comment) since apparently these operations return a clone even when not given any argument.

I hope these optimizations will help/fix this issue #501.

edit: closes #1292

@maximlt maximlt marked this pull request as draft June 4, 2024 15:15
Copy link

codecov bot commented Jun 4, 2024

Codecov Report

Attention: Patch coverage is 96.40288% with 5 lines in your changes missing coverage. Please review.

Project coverage is 88.87%. Comparing base (6c96c7e) to head (7df0911).
Report is 10 commits behind head on main.

Files Patch % Lines
hvplot/converter.py 91.83% 4 Missing ⚠️
hvplot/util.py 94.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1350      +/-   ##
==========================================
+ Coverage   87.39%   88.87%   +1.47%     
==========================================
  Files          50       51       +1     
  Lines        7490     7479      -11     
==========================================
+ Hits         6546     6647     +101     
+ Misses        944      832     -112     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@droumis
Copy link
Member

droumis commented Jun 5, 2024

Relevant to many things, but especially #1160

@droumis
Copy link
Member

droumis commented Jun 5, 2024

issue noted in meeting: "When printing the ndoverlay, the vdim is not a value label.. it's the label of the last element in the ndoverlay"

@philippjfr
Copy link
Member

With holoviz/holoviews#6262 this should now behave well.

@maximlt
Copy link
Member Author

maximlt commented Jun 10, 2024

When printing the ndoverlay, the vdim is not a value label.. it's the label of the last element in the ndoverlay

I confirm this is fixed, i.e. printing out from the code below:

import numpy as np
import pandas as pd
import holoviews as hv
hv.extension('bokeh')

df = pd.DataFrame(np.random.random((5, 4)), columns=list('ABCD'))

value_label = 'value'
group_label = 'Variable'

charts = []
for column in df.columns:
    kdims = ['index']
    vdims = hv.Dimension(column, label=value_label)
    chart = hv.Curve(df, kdims, vdims)
    charts.append((column, chart))

out = hv.NdOverlay(charts, group_label, sort=False)
out

now displays:

:NdOverlay   [Variable]
   :Curve   [index]   (value)

@maximlt
Copy link
Member Author

maximlt commented Jun 10, 2024

The test suite ran fine (https://github.com/holoviz/hvplot/actions/runs/9448341262) with the latest dev version of HoloViews that includes holoviz/holoviews#6262.

@maximlt maximlt marked this pull request as ready for review June 18, 2024 09:32
chart = element(data, kdims, vdims).redim(**{c: self.value_label})
ydim = hv.Dimension(c, label=self.value_label)
kdims, vdims = self._get_dimensions([x], [ydim])
chart = element(data, kdims, vdims)
charts.append((c, chart.relabel(**self._relabel).redim(**self._redim)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's also skip relabel and redim if self._relabel and self._redim are empty respectively.

@droumis
Copy link
Member

droumis commented Jul 9, 2024

ready to merge?

@maximlt maximlt marked this pull request as draft July 16, 2024 15:16
@maximlt maximlt marked this pull request as ready for review July 19, 2024 21:07
@maximlt
Copy link
Member Author

maximlt commented Jul 22, 2024

OP updated with a description of the changes made #1350 (comment). This is ready for review.

@maximlt maximlt changed the title Optimize handling of wide datasets Optimize handling of wide datasets and Pandas indexes Jul 22, 2024
@philippjfr philippjfr merged commit feffee5 into main Jul 23, 2024
9 checks passed
@philippjfr philippjfr deleted the optimize_wide_data branch July 23, 2024 10:14
@philippjfr
Copy link
Member

Great work, all looked good to me and thanks for the new tests! Let's test this as much as possible, would love to see a new alpha release for that purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No longer internally call .reset_index()
3 participants