speed up the repr for big MultiIndex objects #4846

keewis · 2021-01-26T21:59:56Z

I'm not able to check if this actually works for bigger arrays, but with xr.DataArray(pd.Series(range(25_000_000), index=idx)) I get a significant speed-up of about 180x for repr.

Closes Poor performance of repr of large arrays, particularly jupyter repr #4789
Tests added
Passes pre-commit run --all-files
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

…values

keewis · 2021-01-26T22:46:32Z

xarray/core/formatting.py

+    if col_width < len(coord):
+        n_values = col_width // 4
+        indices = list(range(0, n_values)) + list(range(-n_values, 0))
+        subset = coord[indices]
+    else:
+        subset = coord


this could probably use some optimization: how big does the MultiIndex have to be so indexing+get_level_variable is faster than just get_level_variable?

Yeah, though it's so fast already relative to how often repr is called...

Could also defer to pandas, which seem to do this (though a different orientation)

More than fine to leave as a TODO imo

I assumed that for everything below 100 elements get_level_variable is faster than indexing first, which also means that I don't have to worry about the case where the index does not have enough elements to be truncated.

Could also defer to pandas

how would I do that?

how would I do that?

I had meant — they have a repr for multiindex which is fast — so could we use theirs somehow, despite the different orientation. On reflection — our code is simple, I agree with your impulse.

max-sixty · 2021-01-26T22:48:59Z

This is great, thanks!

We could add an ASV if you like, but also fine to leave for another day / PR

keewis · 2021-01-26T23:38:09Z

We could add an ASV

done, I think?

max-sixty

Super!

asv_bench/benchmarks/repr.py

Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>

dcherian · 2021-01-29T23:06:13Z

Thanks @keewis and @max-sixty

* upstream/master: speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) Allow swap_dims to take kwargs (pydata#4841) Move skip ci instructions to contributing guide (pydata#4829) fix issues in drop_sel and drop_isel (pydata#4828) Bugfix in list_engine (pydata#4811) Add drop_isel (pydata#4819) Fix RST. Remove the references to `_file_obj` outside low level code paths, change to `_close` (pydata#4809)

* master: (458 commits) Add units if "unit" is in the attrs. (pydata#4850) speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) Allow swap_dims to take kwargs (pydata#4841) Move skip ci instructions to contributing guide (pydata#4829) fix issues in drop_sel and drop_isel (pydata#4828) Bugfix in list_engine (pydata#4811) Add drop_isel (pydata#4819) Fix RST. Remove the references to `_file_obj` outside low level code paths, change to `_close` (pydata#4809) fix decode for scale/ offset list (pydata#4802) Expand user dir paths (~) in open_mfdataset and to_zarr. (pydata#4795) add a version info step to the upstream-dev CI (pydata#4815) fix the ci trigger action (pydata#4805) scatter plot by order of the first appearance of hue (pydata#4723) ...

…_and_bounds_as_coords * upstream/master: (51 commits) Ensure maximum accuracy when encoding and decoding cftime.datetime values (pydata#4758) Fix `bounds_error=True` ignored with 1D interpolation (pydata#4855) add a drop_conflicts strategy for merging attrs (pydata#4827) update pre-commit hooks (mypy) (pydata#4883) ensure warnings cannot become errors in assert_ (pydata#4864) update pre-commit hooks (pydata#4874) small fixes for the docstrings of swap_dims and integrate (pydata#4867) Modify _encode_datetime_with_cftime for compatibility with cftime > 1.4.0 (pydata#4871) vélin (pydata#4872) don't skip the doctests CI (pydata#4869) fix da.pad example for numpy 1.20 (pydata#4865) temporarily pin dask (pydata#4873) Add units if "unit" is in the attrs. (pydata#4850) speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) Faster unstacking (pydata#4746) ...

* upstream/master: (24 commits) Compatibility with dask 2021.02.0 (pydata#4884) Ensure maximum accuracy when encoding and decoding cftime.datetime values (pydata#4758) Fix `bounds_error=True` ignored with 1D interpolation (pydata#4855) add a drop_conflicts strategy for merging attrs (pydata#4827) update pre-commit hooks (mypy) (pydata#4883) ensure warnings cannot become errors in assert_ (pydata#4864) update pre-commit hooks (pydata#4874) small fixes for the docstrings of swap_dims and integrate (pydata#4867) Modify _encode_datetime_with_cftime for compatibility with cftime > 1.4.0 (pydata#4871) vélin (pydata#4872) don't skip the doctests CI (pydata#4869) fix da.pad example for numpy 1.20 (pydata#4865) temporarily pin dask (pydata#4873) Add units if "unit" is in the attrs. (pydata#4850) speed up the repr for big MultiIndex objects (pydata#4846) dim -> coord in DataArray.integrate (pydata#3993) WIP: backend interface, now it uses subclassing (pydata#4836) weighted: small improvements (pydata#4818) Update related-projects.rst (pydata#4844) iris update doc url (pydata#4845) ...

keewis added 2 commits January 26, 2021 22:54

print the repr of a multiindex using only a subset of the coordinate …

9e3ebf2

…values

don't index if we have less items than available width

1138487

keewis commented Jan 26, 2021

View reviewed changes

max-sixty approved these changes Jan 26, 2021

View reviewed changes

keewis added 3 commits January 27, 2021 00:13

don't try to shorten arrays which are way too short

5e719b1

col_width seems to be the maximum number of elements, not characters

f061dc8

add a asv benchmark

e04152f

max-sixty approved these changes Jan 27, 2021

View reviewed changes

asv_bench/benchmarks/repr.py Outdated Show resolved Hide resolved

asv_bench/benchmarks/repr.py Outdated Show resolved Hide resolved

Apply suggestions from code review

8f20f1f

Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>

dcherian merged commit 39048f9 into pydata:master Jan 29, 2021

keewis deleted the speed-up-repr branch February 2, 2021 02:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up the repr for big MultiIndex objects #4846

speed up the repr for big MultiIndex objects #4846

keewis commented Jan 26, 2021 •

edited

Loading

keewis Jan 26, 2021

max-sixty Jan 26, 2021

keewis Jan 26, 2021 •

edited

Loading

max-sixty Jan 27, 2021

max-sixty commented Jan 26, 2021

keewis commented Jan 26, 2021

max-sixty left a comment

dcherian commented Jan 29, 2021

speed up the repr for big MultiIndex objects #4846

speed up the repr for big MultiIndex objects #4846

Conversation

keewis commented Jan 26, 2021 • edited Loading

keewis Jan 26, 2021

Choose a reason for hiding this comment

max-sixty Jan 26, 2021

Choose a reason for hiding this comment

keewis Jan 26, 2021 • edited Loading

Choose a reason for hiding this comment

max-sixty Jan 27, 2021

Choose a reason for hiding this comment

max-sixty commented Jan 26, 2021

keewis commented Jan 26, 2021

max-sixty left a comment

Choose a reason for hiding this comment

dcherian commented Jan 29, 2021

keewis commented Jan 26, 2021 •

edited

Loading

keewis Jan 26, 2021 •

edited

Loading