Poor performance of repr of large arrays, particularly jupyter repr #4789

max-sixty · 2021-01-11T00:28:24Z

What happened:

The _repr_html_ method of large arrays seems very slow — 4.78s in the case of a 100m value array; and the general repr seems fairly slow — 1.87s. Here's a quick example. I haven't yet investigated how dependent it is on there being a MultiIndex.

What you expected to happen:

We should really focus on having good repr performance, given how essential it is to any REPL workflow.

Minimal Complete Verifiable Example:

In [10]: import xarray as xr
    ...: import numpy as np
    ...: import pandas as pd

In [11]: idx = pd.MultiIndex.from_product([range(10_000), range(10_000)])

In [12]: df = pd.DataFrame(range(100_000_000), index=idx)

In [13]: da = xr.DataArray(df)

In [14]: da
Out[14]:
<xarray.DataArray (dim_0: 100000000, dim_1: 1)>
array([[       0],
       [       1],
       [       2],
       ...,
       [99999997],
       [99999998],
       [99999999]])
Coordinates:
  * dim_0          (dim_0) MultiIndex
  - dim_0_level_0  (dim_0) int64 0 0 0 0 0 0 0 ... 9999 9999 9999 9999 9999 9999
  - dim_0_level_1  (dim_0) int64 0 1 2 3 4 5 6 ... 9994 9995 9996 9997 9998 9999
  * dim_1          (dim_1) int64 0


In [26]: %timeit repr(da)
1.87 s ± 7.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [27]: %timeit da._repr_html_()
4.78 s ± 1.8 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.7 (default, Dec 30 2020, 10:13:08)
[Clang 12.0.0 (clang-1200.0.32.28)]
python-bits: 64
OS: Darwin
OS-release: 19.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.16.3.dev48+gbf0fe2ca
pandas: 1.1.3
numpy: 1.19.2
scipy: 1.5.3
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.5.0
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.3.2
dask: 2.30.0
distributed: None
matplotlib: 3.3.2
cartopy: None
seaborn: 0.11.0
numbagg: installed
pint: 0.16.1
setuptools: 51.1.1
pip: 20.3.3
conda: None
pytest: 6.1.1
IPython: 7.19.0
sphinx: None

The text was updated successfully, but these errors were encountered:

rabernat · 2021-01-12T03:36:26Z

I uncovered this issue with Dask's SVG in its _repr_html function: dask/dask#6670. The fix made a big difference in repr size. Possibly related?

max-sixty · 2021-01-25T02:46:40Z

One quick observation is that it's related to the MultiIndex — if we swap out the index for idx = pd.Index(range(100_000_000)), the time drops from 1.8s to 812mics

max-sixty · 2021-01-25T03:36:23Z

The rabbit hole went deeper than I expected. I need to sign off now, but leaving what I have in case someone else has some insight.

Essentially, we call get_level_variable on the coord in formatting.py, which calls get_level_values into pandas. This is really slow on large MultiIndexes! I think it's recreating the whole index. I got as deep as algos.take_1d.

I think we can probably do something smarter to only call this on the first & last items in the MultiIndex.

For reference, here's the output of line_profiler, a good profiler for figuring this sort of thing out:

%lprun  -f formatting._summarize_coord_levels -f IndexVariable.get_level_variable -f pd.MultiIndex.get_level_values -f pd.MultiIndex._get_level_values coords_repr(da.coords)



Total time: 1.91029 s
File: /Users/maximilian/workspace/xarray/xarray/core/formatting.py
Function: _summarize_coord_levels at line 302

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   302                                           def _summarize_coord_levels(coord, col_width, marker="-"):
   303         2    1910185.0 955092.5    100.0      return "\n".join(
   304                                                   summarize_variable(
   305                                                       lname, coord.get_level_variable(lname), col_width, marker=marker
   306                                                   )
   307         1        102.0    102.0      0.0          for lname in coord.level_names
   308                                               )

Total time: 1.81777 s
File: /Users/maximilian/workspace/xarray/xarray/core/variable.py
Function: get_level_variable at line 2687

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  2687                                               def get_level_variable(self, level):
  2688                                                   """Return a new IndexVariable from a given MultiIndex level."""
  2689         2        303.0    151.5      0.0          if self.level_names is None:
  2690                                                       raise ValueError("IndexVariable %r has no MultiIndex" % self.name)
  2691         2        216.0    108.0      0.0          index = self.to_index()
  2692         2    1817254.0 908627.0    100.0          return type(self)(self.dims, index.get_level_values(level))

Total time: 1.81709 s
File: /usr/local/lib/python3.9/site-packages/pandas/core/indexes/multi.py
Function: _get_level_values at line 1617

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1617                                               def _get_level_values(self, level, unique=False):
  1618                                                   """
  1619                                                   Return vector of label values for requested level,
  1620                                                   equal to the length of the index
  1621                                           
  1622                                                   **this is an internal method**
  1623                                           
  1624                                                   Parameters
  1625                                                   ----------
  1626                                                   level : int level
  1627                                                   unique : bool, default False
  1628                                                       if True, drop duplicated values
  1629                                           
  1630                                                   Returns
  1631                                                   -------
  1632                                                   values : ndarray
  1633                                                   """
  1634         2         47.0     23.5      0.0          lev = self.levels[level]
  1635         2          5.0      2.5      0.0          level_codes = self.codes[level]
  1636         2          2.0      1.0      0.0          name = self._names[level]
  1637         2          1.0      0.5      0.0          if unique:
  1638                                                       level_codes = algos.unique(level_codes)
  1639         2    1816971.0 908485.5    100.0          filled = algos.take_1d(lev._values, level_codes, fill_value=lev._na_value)
  1640         2         60.0     30.0      0.0          return lev._shallow_copy(filled, name=name)

Total time: 1.81712 s
File: /usr/local/lib/python3.9/site-packages/pandas/core/indexes/multi.py
Function: get_level_values at line 1642

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1642                                               def get_level_values(self, level):
  1643                                                   """
  1644                                                   Return vector of label values for requested level.
  1645                                           
  1646                                                   Length of returned vector is equal to the length of the index.
  1647                                           
  1648                                                   Parameters
  1649                                                   ----------
  1650                                                   level : int or str
  1651                                                       ``level`` is either the integer position of the level in the
  1652                                                       MultiIndex, or the name of the level.
  1653                                           
  1654                                                   Returns
  1655                                                   -------
  1656                                                   values : Index
  1657                                                       Values is a level of this MultiIndex converted to
  1658                                                       a single :class:`Index` (or subclass thereof).
  1659                                           
  1660                                                   Examples
  1661                                                   --------
  1662                                                   Create a MultiIndex:
  1663                                           
  1664                                                   >>> mi = pd.MultiIndex.from_arrays((list('abc'), list('def')))
  1665                                                   >>> mi.names = ['level_1', 'level_2']
  1666                                           
  1667                                                   Get level values by supplying level as either integer or name:
  1668                                           
  1669                                                   >>> mi.get_level_values(0)
  1670                                                   Index(['a', 'b', 'c'], dtype='object', name='level_1')
  1671                                                   >>> mi.get_level_values('level_2')
  1672                                                   Index(['d', 'e', 'f'], dtype='object', name='level_2')
  1673                                                   """
  1674         2         11.0      5.5      0.0          level = self._get_level_number(level)
  1675         2    1817107.0 908553.5    100.0          values = self._get_level_values(level)
  1676         2          2.0      1.0      0.0          return values

keewis · 2021-01-25T17:33:19Z

that seems to be the main issue. With

diff --git a/xarray/core/formatting.py b/xarray/core/formatting.py
index 282620e3..f825ed85 100644
--- a/xarray/core/formatting.py
+++ b/xarray/core/formatting.py
@@ -300,9 +300,11 @@ def _summarize_coord_multiindex(coord, col_width, marker):
 
 
 def _summarize_coord_levels(coord, col_width, marker="-"):
+    indices = list(range(10)) + list(range(-10, 0))
+    subset = coord[indices]
     return "\n".join(
         summarize_variable(
-            lname, coord.get_level_variable(lname), col_width, marker=marker
+            lname, subset.get_level_variable(lname), col_width, marker=marker
         )
         for lname in coord.level_names
     )

I get a speed up of about 180x (for xr.DataArray(pd.Series(25_000_000, index=idx)), not sure if the speed-up is as significant for bigger arrays). We should probably make the shape of indices depend on col_width, though.

max-sixty · 2021-01-25T23:50:56Z

Yes great, I think that would be a great cut-through solution!

dcherian added the topic-html-repr label Jan 25, 2021

keewis added the topic-performance label Jan 25, 2021

keewis mentioned this issue Jan 26, 2021

speed up the repr for big MultiIndex objects #4846

Merged

5 tasks

dcherian closed this as completed in #4846 Jan 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance of repr of large arrays, particularly jupyter repr #4789

Poor performance of repr of large arrays, particularly jupyter repr #4789

max-sixty commented Jan 11, 2021

INSTALLED VERSIONS

rabernat commented Jan 12, 2021

max-sixty commented Jan 25, 2021

max-sixty commented Jan 25, 2021

keewis commented Jan 25, 2021 •

edited

Loading

max-sixty commented Jan 25, 2021

Poor performance of repr of large arrays, particularly jupyter repr #4789

Poor performance of repr of large arrays, particularly jupyter repr #4789

Comments

max-sixty commented Jan 11, 2021

INSTALLED VERSIONS

rabernat commented Jan 12, 2021

max-sixty commented Jan 25, 2021

max-sixty commented Jan 25, 2021

keewis commented Jan 25, 2021 • edited Loading

max-sixty commented Jan 25, 2021

keewis commented Jan 25, 2021 •

edited

Loading