Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of repr of large arrays, particularly jupyter repr #4789

Closed
max-sixty opened this issue Jan 11, 2021 · 5 comments · Fixed by #4846
Closed

Poor performance of repr of large arrays, particularly jupyter repr #4789

max-sixty opened this issue Jan 11, 2021 · 5 comments · Fixed by #4846

Comments

@max-sixty
Copy link
Collaborator

What happened:

The _repr_html_ method of large arrays seems very slow — 4.78s in the case of a 100m value array; and the general repr seems fairly slow — 1.87s. Here's a quick example. I haven't yet investigated how dependent it is on there being a MultiIndex.

What you expected to happen:

We should really focus on having good repr performance, given how essential it is to any REPL workflow.

Minimal Complete Verifiable Example:

In [10]: import xarray as xr
    ...: import numpy as np
    ...: import pandas as pd

In [11]: idx = pd.MultiIndex.from_product([range(10_000), range(10_000)])

In [12]: df = pd.DataFrame(range(100_000_000), index=idx)

In [13]: da = xr.DataArray(df)

In [14]: da
Out[14]:
<xarray.DataArray (dim_0: 100000000, dim_1: 1)>
array([[       0],
       [       1],
       [       2],
       ...,
       [99999997],
       [99999998],
       [99999999]])
Coordinates:
  * dim_0          (dim_0) MultiIndex
  - dim_0_level_0  (dim_0) int64 0 0 0 0 0 0 0 ... 9999 9999 9999 9999 9999 9999
  - dim_0_level_1  (dim_0) int64 0 1 2 3 4 5 6 ... 9994 9995 9996 9997 9998 9999
  * dim_1          (dim_1) int64 0


In [26]: %timeit repr(da)
1.87 s ± 7.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [27]: %timeit da._repr_html_()
4.78 s ± 1.8 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.7 (default, Dec 30 2020, 10:13:08)
[Clang 12.0.0 (clang-1200.0.32.28)]
python-bits: 64
OS: Darwin
OS-release: 19.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.16.3.dev48+gbf0fe2ca
pandas: 1.1.3
numpy: 1.19.2
scipy: 1.5.3
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.5.0
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2.30.0
distributed: None
matplotlib: 3.3.2
cartopy: None
seaborn: 0.11.0
numbagg: installed
pint: 0.16.1
setuptools: 51.1.1
pip: 20.3.3
conda: None
pytest: 6.1.1
IPython: 7.19.0
sphinx: None

@rabernat
Copy link
Contributor

I uncovered this issue with Dask's SVG in its _repr_html function: dask/dask#6670. The fix made a big difference in repr size. Possibly related?

@max-sixty
Copy link
Collaborator Author

One quick observation is that it's related to the MultiIndex — if we swap out the index for idx = pd.Index(range(100_000_000)), the time drops from 1.8s to 812mics

@max-sixty
Copy link
Collaborator Author

The rabbit hole went deeper than I expected. I need to sign off now, but leaving what I have in case someone else has some insight.

Essentially, we call get_level_variable on the coord in formatting.py, which calls get_level_values into pandas. This is really slow on large MultiIndexes! I think it's recreating the whole index. I got as deep as algos.take_1d.

I think we can probably do something smarter to only call this on the first & last items in the MultiIndex.

For reference, here's the output of line_profiler, a good profiler for figuring this sort of thing out:

%lprun  -f formatting._summarize_coord_levels -f IndexVariable.get_level_variable -f pd.MultiIndex.get_level_values -f pd.MultiIndex._get_level_values coords_repr(da.coords)



Total time: 1.91029 s
File: /Users/maximilian/workspace/xarray/xarray/core/formatting.py
Function: _summarize_coord_levels at line 302

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   302                                           def _summarize_coord_levels(coord, col_width, marker="-"):
   303         2    1910185.0 955092.5    100.0      return "\n".join(
   304                                                   summarize_variable(
   305                                                       lname, coord.get_level_variable(lname), col_width, marker=marker
   306                                                   )
   307         1        102.0    102.0      0.0          for lname in coord.level_names
   308                                               )

Total time: 1.81777 s
File: /Users/maximilian/workspace/xarray/xarray/core/variable.py
Function: get_level_variable at line 2687

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  2687                                               def get_level_variable(self, level):
  2688                                                   """Return a new IndexVariable from a given MultiIndex level."""
  2689         2        303.0    151.5      0.0          if self.level_names is None:
  2690                                                       raise ValueError("IndexVariable %r has no MultiIndex" % self.name)
  2691         2        216.0    108.0      0.0          index = self.to_index()
  2692         2    1817254.0 908627.0    100.0          return type(self)(self.dims, index.get_level_values(level))

Total time: 1.81709 s
File: /usr/local/lib/python3.9/site-packages/pandas/core/indexes/multi.py
Function: _get_level_values at line 1617

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1617                                               def _get_level_values(self, level, unique=False):
  1618                                                   """
  1619                                                   Return vector of label values for requested level,
  1620                                                   equal to the length of the index
  1621                                           
  1622                                                   **this is an internal method**
  1623                                           
  1624                                                   Parameters
  1625                                                   ----------
  1626                                                   level : int level
  1627                                                   unique : bool, default False
  1628                                                       if True, drop duplicated values
  1629                                           
  1630                                                   Returns
  1631                                                   -------
  1632                                                   values : ndarray
  1633                                                   """
  1634         2         47.0     23.5      0.0          lev = self.levels[level]
  1635         2          5.0      2.5      0.0          level_codes = self.codes[level]
  1636         2          2.0      1.0      0.0          name = self._names[level]
  1637         2          1.0      0.5      0.0          if unique:
  1638                                                       level_codes = algos.unique(level_codes)
  1639         2    1816971.0 908485.5    100.0          filled = algos.take_1d(lev._values, level_codes, fill_value=lev._na_value)
  1640         2         60.0     30.0      0.0          return lev._shallow_copy(filled, name=name)

Total time: 1.81712 s
File: /usr/local/lib/python3.9/site-packages/pandas/core/indexes/multi.py
Function: get_level_values at line 1642

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1642                                               def get_level_values(self, level):
  1643                                                   """
  1644                                                   Return vector of label values for requested level.
  1645                                           
  1646                                                   Length of returned vector is equal to the length of the index.
  1647                                           
  1648                                                   Parameters
  1649                                                   ----------
  1650                                                   level : int or str
  1651                                                       ``level`` is either the integer position of the level in the
  1652                                                       MultiIndex, or the name of the level.
  1653                                           
  1654                                                   Returns
  1655                                                   -------
  1656                                                   values : Index
  1657                                                       Values is a level of this MultiIndex converted to
  1658                                                       a single :class:`Index` (or subclass thereof).
  1659                                           
  1660                                                   Examples
  1661                                                   --------
  1662                                                   Create a MultiIndex:
  1663                                           
  1664                                                   >>> mi = pd.MultiIndex.from_arrays((list('abc'), list('def')))
  1665                                                   >>> mi.names = ['level_1', 'level_2']
  1666                                           
  1667                                                   Get level values by supplying level as either integer or name:
  1668                                           
  1669                                                   >>> mi.get_level_values(0)
  1670                                                   Index(['a', 'b', 'c'], dtype='object', name='level_1')
  1671                                                   >>> mi.get_level_values('level_2')
  1672                                                   Index(['d', 'e', 'f'], dtype='object', name='level_2')
  1673                                                   """
  1674         2         11.0      5.5      0.0          level = self._get_level_number(level)
  1675         2    1817107.0 908553.5    100.0          values = self._get_level_values(level)
  1676         2          2.0      1.0      0.0          return values

@keewis
Copy link
Collaborator

keewis commented Jan 25, 2021

that seems to be the main issue. With

diff --git a/xarray/core/formatting.py b/xarray/core/formatting.py
index 282620e3..f825ed85 100644
--- a/xarray/core/formatting.py
+++ b/xarray/core/formatting.py
@@ -300,9 +300,11 @@ def _summarize_coord_multiindex(coord, col_width, marker):
 
 
 def _summarize_coord_levels(coord, col_width, marker="-"):
+    indices = list(range(10)) + list(range(-10, 0))
+    subset = coord[indices]
     return "\n".join(
         summarize_variable(
-            lname, coord.get_level_variable(lname), col_width, marker=marker
+            lname, subset.get_level_variable(lname), col_width, marker=marker
         )
         for lname in coord.level_names
     )

I get a speed up of about 180x (for xr.DataArray(pd.Series(25_000_000, index=idx)), not sure if the speed-up is as significant for bigger arrays). We should probably make the shape of indices depend on col_width, though.

@max-sixty
Copy link
Collaborator Author

Yes great, I think that would be a great cut-through solution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants