Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up the repr for big MultiIndex objects #4846

Merged
merged 6 commits into from
Jan 29, 2021

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Jan 26, 2021

I'm not able to check if this actually works for bigger arrays, but with xr.DataArray(pd.Series(range(25_000_000), index=idx)) I get a significant speed-up of about 180x for repr.

Comment on lines 303 to 308
if col_width < len(coord):
n_values = col_width // 4
indices = list(range(0, n_values)) + list(range(-n_values, 0))
subset = coord[indices]
else:
subset = coord
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could probably use some optimization: how big does the MultiIndex have to be so indexing+get_level_variable is faster than just get_level_variable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, though it's so fast already relative to how often repr is called...

Could also defer to pandas, which seem to do this (though a different orientation)

More than fine to leave as a TODO imo

Copy link
Collaborator Author

@keewis keewis Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that for everything below 100 elements get_level_variable is faster than indexing first, which also means that I don't have to worry about the case where the index does not have enough elements to be truncated.

Could also defer to pandas

how would I do that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would I do that?

I had meant — they have a repr for multiindex which is fast — so could we use theirs somehow, despite the different orientation. On reflection — our code is simple, I agree with your impulse.

@max-sixty
Copy link
Collaborator

This is great, thanks!

We could add an ASV if you like, but also fine to leave for another day / PR

@keewis
Copy link
Collaborator Author

keewis commented Jan 26, 2021

We could add an ASV

done, I think?

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super!

asv_bench/benchmarks/repr.py Outdated Show resolved Hide resolved
asv_bench/benchmarks/repr.py Outdated Show resolved Hide resolved
Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
@dcherian dcherian merged commit 39048f9 into pydata:master Jan 29, 2021
@dcherian
Copy link
Contributor

Thanks @keewis and @max-sixty

dcherian added a commit to dcherian/xarray that referenced this pull request Jan 29, 2021
* upstream/master:
  speed up the repr for big MultiIndex objects (pydata#4846)
  dim -> coord in DataArray.integrate (pydata#3993)
  WIP: backend interface, now it uses subclassing  (pydata#4836)
  weighted: small improvements (pydata#4818)
  Update related-projects.rst (pydata#4844)
  iris update doc url (pydata#4845)
  Faster unstacking (pydata#4746)
  Allow swap_dims to take kwargs (pydata#4841)
  Move skip ci instructions to contributing guide (pydata#4829)
  fix issues in drop_sel and drop_isel (pydata#4828)
  Bugfix in list_engine (pydata#4811)
  Add drop_isel (pydata#4819)
  Fix RST.
  Remove the references to `_file_obj` outside low level code paths, change to `_close` (pydata#4809)
@keewis keewis deleted the speed-up-repr branch February 2, 2021 02:01
dcherian added a commit to dcherian/xarray that referenced this pull request Feb 3, 2021
* master: (458 commits)
  Add units if "unit" is in the attrs. (pydata#4850)
  speed up the repr for big MultiIndex objects (pydata#4846)
  dim -> coord in DataArray.integrate (pydata#3993)
  WIP: backend interface, now it uses subclassing  (pydata#4836)
  weighted: small improvements (pydata#4818)
  Update related-projects.rst (pydata#4844)
  iris update doc url (pydata#4845)
  Faster unstacking (pydata#4746)
  Allow swap_dims to take kwargs (pydata#4841)
  Move skip ci instructions to contributing guide (pydata#4829)
  fix issues in drop_sel and drop_isel (pydata#4828)
  Bugfix in list_engine (pydata#4811)
  Add drop_isel (pydata#4819)
  Fix RST.
  Remove the references to `_file_obj` outside low level code paths, change to `_close` (pydata#4809)
  fix decode for scale/ offset list (pydata#4802)
  Expand user dir paths (~) in open_mfdataset and to_zarr. (pydata#4795)
  add a version info step to the upstream-dev CI (pydata#4815)
  fix the ci trigger action (pydata#4805)
  scatter plot by order of the first appearance of hue (pydata#4723)
  ...
dcherian added a commit to DWesl/xarray that referenced this pull request Feb 11, 2021
…_and_bounds_as_coords

* upstream/master: (51 commits)
  Ensure maximum accuracy when encoding and decoding cftime.datetime values (pydata#4758)
  Fix `bounds_error=True` ignored with 1D interpolation (pydata#4855)
  add a drop_conflicts strategy for merging attrs (pydata#4827)
  update pre-commit hooks (mypy) (pydata#4883)
  ensure warnings cannot become errors in assert_ (pydata#4864)
  update pre-commit hooks (pydata#4874)
  small fixes for the docstrings of swap_dims and integrate (pydata#4867)
  Modify _encode_datetime_with_cftime for compatibility with cftime > 1.4.0 (pydata#4871)
  vélin (pydata#4872)
  don't skip the doctests CI (pydata#4869)
  fix da.pad example for numpy 1.20 (pydata#4865)
  temporarily pin dask (pydata#4873)
  Add units if "unit" is in the attrs. (pydata#4850)
  speed up the repr for big MultiIndex objects (pydata#4846)
  dim -> coord in DataArray.integrate (pydata#3993)
  WIP: backend interface, now it uses subclassing  (pydata#4836)
  weighted: small improvements (pydata#4818)
  Update related-projects.rst (pydata#4844)
  iris update doc url (pydata#4845)
  Faster unstacking (pydata#4746)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Feb 12, 2021
* upstream/master: (24 commits)
  Compatibility with dask 2021.02.0 (pydata#4884)
  Ensure maximum accuracy when encoding and decoding cftime.datetime values (pydata#4758)
  Fix `bounds_error=True` ignored with 1D interpolation (pydata#4855)
  add a drop_conflicts strategy for merging attrs (pydata#4827)
  update pre-commit hooks (mypy) (pydata#4883)
  ensure warnings cannot become errors in assert_ (pydata#4864)
  update pre-commit hooks (pydata#4874)
  small fixes for the docstrings of swap_dims and integrate (pydata#4867)
  Modify _encode_datetime_with_cftime for compatibility with cftime > 1.4.0 (pydata#4871)
  vélin (pydata#4872)
  don't skip the doctests CI (pydata#4869)
  fix da.pad example for numpy 1.20 (pydata#4865)
  temporarily pin dask (pydata#4873)
  Add units if "unit" is in the attrs. (pydata#4850)
  speed up the repr for big MultiIndex objects (pydata#4846)
  dim -> coord in DataArray.integrate (pydata#3993)
  WIP: backend interface, now it uses subclassing  (pydata#4836)
  weighted: small improvements (pydata#4818)
  Update related-projects.rst (pydata#4844)
  iris update doc url (pydata#4845)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Poor performance of repr of large arrays, particularly jupyter repr
3 participants