PERF: groupby performance regression in 1.2.x #41596

tritemio · 2021-05-20T22:52:58Z

[ x] I have checked that this issue has not already been reported.
[x ] I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
pd.__version__

cols = list('abcdefghjkl')
df = pd.DataFrame(np.random.randint(0, 100, size=(1_000_000, len(cols))), columns=cols)
df_str = df.astype(str)
df_string = df.astype('string')

%timeit df_str.groupby('a')[cols[1:]].agg('last')

%timeit df_string.groupby('a')[cols[1:]].agg('last')

Problem description

Pandas 1.2.x is much slower (9x slower) than 1.1.5 in the groupby aggregation above when the columns are of string dtype. When the columns are of object dtype performance are comparable across the two pandas version.

Expected Output

In pandas 1.1.5 this groupby-aggregation is a bit faster with string dtype than with object dtype:

%timeit df_str.groupby('a')[cols[1:]].agg('last')
680 ms ± 3.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_string.groupby('a')[cols[1:]].agg('last')
544 ms ± 3.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conversely, in pandas 1.2.4 the same groupby-aggregation is 7x slower with string dtype than with object dtype:

%timeit df_str.groupby('a')[cols[1:]].agg('last')
700 ms ± 7.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_string.groupby('a')[cols[1:]].agg('last')
4.93 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I would expect comparable performance between pandas 1.1.5 and 1.2.4, instead we have a large performance regression in 1.2.4 when performing the groupby aggregation with string dtypes.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : b5958ee
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-17-generic
Version : #18-Ubuntu SMP Thu May 6 20:10:11 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.5
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.1
setuptools : 49.6.0.post20210108
Cython : 0.29.23
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.23.1
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 4.0.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : 1.3.23
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : 0.53.1

The text was updated successfully, but these errors were encountered:

mzeitlin11 · 2021-05-21T19:45:27Z

Thanks for looking into this @tritemio! I think the issue may that we're now taking the slow path (non-cython) because of logic here

pandas/pandas/core/groupby/ops.py

Line 321 in 751d500

def _ea_wrap_cython_operation(

raising an error such that we go to the slower fallback. Something like last would work for object type string data, so some dispatch logic for StringDType could be added, though there may be subtleties where that would cause other issues.

tritemio · 2021-05-24T05:32:31Z

@mzeitlin11 thanks for the answer. Yes, the regression is due to a slow python path taken for StringDType in pandas 1.2x, while a fast path was taken until 1.1.5.

I'm attaching the profiling output of the groupby on the two pandas versions. Maybe it can help track down the function at the origin of the slow down.

Note that the traces are the same until the call to _cython_agg_blocks then they take different paths.

# pandas 1.1.5, StringDType
%prun -l 20 -s cumulative  df_string.groupby('a')[cols[1:]].agg('last')

         7137 function calls (7004 primitive calls) in 0.597 seconds

   Ordered by: cumulative time
   List reduced from 413 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.597    0.597 {built-in method builtins.exec}
        1    0.000    0.000    0.597    0.597 <string>:1(<module>)
        1    0.000    0.000    0.596    0.596 generic.py:937(aggregate)
        1    0.000    0.000    0.596    0.596 base.py:281(_aggregate)
        1    0.000    0.000    0.596    0.596 base.py:251(_try_aggregate_string_function)
        1    0.000    0.000    0.596    0.596 groupby.py:1588(last)
        1    0.000    0.000    0.596    0.596 groupby.py:987(_agg_general)
        1    0.000    0.000    0.595    0.595 generic.py:1018(_cython_agg_general)
        1    0.000    0.000    0.595    0.595 generic.py:1026(_cython_agg_blocks)
       10    0.062    0.006    0.593    0.059 ops.py:588(aggregate)
       10    0.001    0.000    0.531    0.053 ops.py:443(_cython_operation)
       10    0.300    0.030    0.300    0.030 ops.py:598(_aggregate)
        1    0.000    0.000    0.146    0.146 ops.py:302(ngroups)
        1    0.000    0.000    0.146    0.146 ops.py:312(result_index)
        1    0.000    0.000    0.146    0.146 grouper.py:573(result_index)
        2    0.000    0.000    0.146    0.073 grouper.py:579(group_index)
        1    0.000    0.000    0.146    0.146 grouper.py:586(_make_codes)
        1    0.009    0.009    0.146    0.146 algorithms.py:518(factorize)
        1    0.000    0.000    0.135    0.135 base.py:801(factorize)
       10    0.000    0.000    0.081    0.008 string_.py:264(astype)

And in pandas 1.2.4:

# pandas 1.2.4, StringDType
%prun -l 20 -s cumulative  df_string.groupby('a')[cols[1:]].agg('last')

320990 function calls (317555 primitive calls) in 5.012 seconds

   Ordered by: cumulative time
   List reduced from 622 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    111/1    0.007    0.000    5.012    5.012 {built-in method builtins.exec}
        1    0.000    0.000    5.011    5.011 generic.py:931(aggregate)
        1    0.000    0.000    5.011    5.011 aggregation.py:549(aggregate)
        1    0.000    0.000    5.011    5.011 base.py:303(_try_aggregate_string_function)
        1    0.000    0.000    5.011    5.011 groupby.py:1705(last)
        1    0.000    0.000    5.011    5.011 groupby.py:1011(_agg_general)
        1    0.000    0.000    5.011    5.011 generic.py:1012(_cython_agg_general)
        1    0.000    0.000    5.010    5.010 generic.py:1020(_cython_agg_blocks)
        2    0.000    0.000    5.010    2.505 managers.py:376(apply)
       10    0.000    0.000    5.009    0.501 blocks.py:372(apply)
       10    0.000    0.000    5.007    0.501 generic.py:1094(blk_func)
       10    0.000    0.000    5.006    0.501 generic.py:1051(py_fallback)
       10    0.000    0.000    5.003    0.500 generic.py:223(aggregate)
       10    0.000    0.000    5.003    0.500 groupby.py:1157(_python_agg_general)
       10    0.155    0.015    4.875    0.488 ops.py:686(agg_series)
       10    0.010    0.001    4.720    0.472 ops.py:735(_aggregate_series_pure_python)
     1010    0.004    0.000    2.963    0.003 ops.py:969(__iter__)
       10    0.000    0.000    2.095    0.209 ops.py:982(_get_sorted_data)
       10    0.000    0.000    2.040    0.204 series.py:791(take)
       11    0.000    0.000    1.942    0.177 _mixins.py:60(take)

For completeness, this is the proilfing output with the object str dtype:

# pandas 1.1.5, object str dtype 
%prun -l 20 -s cumulative df_str.groupby('a')[cols[1:]].agg('last')

4428 function calls (4412 primitive calls) in 0.717 seconds

   Ordered by: cumulative time
   List reduced from 356 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.717    0.717 {built-in method builtins.exec}
        1    0.077    0.077    0.717    0.717 <string>:1(<module>)
        1    0.000    0.000    0.638    0.638 generic.py:937(aggregate)
        1    0.000    0.000    0.638    0.638 base.py:281(_aggregate)
        1    0.000    0.000    0.638    0.638 base.py:251(_try_aggregate_string_function)
        1    0.000    0.000    0.638    0.638 groupby.py:1588(last)
        1    0.000    0.000    0.638    0.638 groupby.py:987(_agg_general)
        1    0.000    0.000    0.638    0.638 generic.py:1018(_cython_agg_general)
        1    0.000    0.000    0.636    0.636 generic.py:1026(_cython_agg_blocks)
        1    0.073    0.073    0.506    0.506 ops.py:588(aggregate)
        1    0.000    0.000    0.433    0.433 ops.py:443(_cython_operation)
        1    0.273    0.273    0.273    0.273 ops.py:598(_aggregate)
        4    0.000    0.000    0.131    0.033 algorithms.py:1640(take_nd)
        1    0.000    0.000    0.129    0.129 generic.py:1656(_get_data_to_aggregate)
        1    0.000    0.000    0.129    0.129 base.py:201(_obj_with_exclusions)
        1    0.000    0.000    0.129    0.129 _decorators.py:307(wrapper)
        1    0.000    0.000    0.129    0.129 frame.py:4017(reindex)
        1    0.000    0.000    0.129    0.129 generic.py:4216(reindex)
        1    0.000    0.000    0.129    0.129 frame.py:3871(_reindex_axes)
        1    0.000    0.000    0.129    0.129 frame.py:3908(_reindex_columns)

# pandas 1.2.4, object str dtype
%prun -l 20 -s cumulative df_str.groupby('a')[cols[1:]].agg('last')

         4646 function calls (4618 primitive calls) in 0.711 seconds

   Ordered by: cumulative time
   List reduced from 413 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.711    0.711 {built-in method builtins.exec}
        1    0.072    0.072    0.711    0.711 <string>:1(<module>)
        1    0.000    0.000    0.620    0.620 generic.py:931(aggregate)
        1    0.000    0.000    0.620    0.620 aggregation.py:549(aggregate)
        1    0.000    0.000    0.620    0.620 base.py:303(_try_aggregate_string_function)
        1    0.000    0.000    0.620    0.620 groupby.py:1705(last)
        1    0.000    0.000    0.620    0.620 groupby.py:1011(_agg_general)
        1    0.000    0.000    0.619    0.619 generic.py:1012(_cython_agg_general)
        1    0.000    0.000    0.617    0.617 generic.py:1020(_cython_agg_blocks)
        2    0.000    0.000    0.511    0.256 managers.py:376(apply)
        1    0.000    0.000    0.510    0.510 blocks.py:372(apply)
        1    0.068    0.068    0.510    0.510 generic.py:1094(blk_func)
        1    0.000    0.000    0.442    0.442 ops.py:550(_cython_operation)
        1    0.284    0.284    0.284    0.284 ops.py:664(_aggregate)
        4    0.000    0.000    0.109    0.027 algorithms.py:1661(take_nd)
        1    0.000    0.000    0.107    0.107 generic.py:1603(_get_data_to_aggregate)
        1    0.000    0.000    0.107    0.107 base.py:226(_obj_with_exclusions)
        1    0.000    0.000    0.107    0.107 _decorators.py:310(wrapper)
        1    0.000    0.000    0.107    0.107 frame.py:4157(reindex)
        1    0.000    0.000    0.107    0.107 generic.py:4564(reindex)

simonjayhawkins · 2021-06-04T11:17:57Z

Conversely, in pandas 1.2.4 the same groupby-aggregation is 7x slower

appears there is a further slowdown on master for the first case.

simonjayhawkins · 2021-06-16T14:49:21Z

@pandas-dev/pandas-core moving this to 1.3 (which will probably get moved to 1.3.1) as no PRs to fix and also an additional performance regression identified for 1.3

debnathshoham · 2021-09-19T13:23:40Z

I am not seeing much drop in perf between master and 1.1.5 for the object dtype.
Am I missing something @simonjayhawkins ?

In [8]: import numpy as np
   ...: import pandas as pd
   ...: pd.__version__
   ...: 
   ...: cols = list('abcdefghjkl')
   ...: df = pd.DataFrame(np.random.randint(0, 100, size=(1_000_000, len(cols))), columns=cols)
   ...: df_str = df.astype(str)
   ...: df_string = df.astype('string')
   ...: 
   ...: %timeit df_str.groupby('a')[cols[1:]].agg('last')
990 ms ± 242 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: pd.__version__
Out[9]: '1.4.0.dev0+722.ge7e7b40722'

In [6]: import numpy as np
   ...: import pandas as pd
   ...: pd.__version__
   ...: 
   ...: cols = list('abcdefghjkl')
   ...: df = pd.DataFrame(np.random.randint(0, 100, size=(1_000_000, len(cols))), columns=cols)
   ...: df_str = df.astype(str)
   ...: df_string = df.astype('string')
   ...: 
   ...: %timeit df_str.groupby('a')[cols[1:]].agg('last')
1.05 s ± 409 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: pd.__version__
Out[7]: '1.1.5'

simonjayhawkins · 2021-09-21T15:06:58Z

looks like that got fixed...

%timeit df_str.groupby('a')[cols[1:]].agg('last')
# 769 ms ± 43.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- master (1/6)
# 435 ms ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- 1.2.4
# 417 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- 1.1.5

# 316 ms ± 3.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- master (21/9)
# 325 ms ± 9.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.3
# 318 ms ± 5.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.2
# 321 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.1
# 787 ms ± 46.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.0

tritemio added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 20, 2021

simonjayhawkins added this to the 1.2.5 milestone May 24, 2021

simonjayhawkins mentioned this issue Jun 10, 2021

RLS: 1.2.5 #40917

Closed

simonjayhawkins modified the milestones: 1.2.5, 1.3 Jun 16, 2021

simonjayhawkins changed the title ~~BUG: groupby performance regression in 1.2.x~~ PERF: groupby performance regression in 1.2.x Jun 25, 2021

simonjayhawkins modified the milestones: 1.3, 1.3.1 Jun 30, 2021

simonjayhawkins modified the milestones: 1.3.1, 1.3.2 Jul 24, 2021

simonjayhawkins modified the milestones: 1.3.2, 1.3.3 Aug 15, 2021

simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021

debnathshoham mentioned this issue Sep 18, 2021

PERF: groupby with string dtype #43634

Merged

4 tasks

jreback closed this as completed in #43634 Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: groupby performance regression in 1.2.x #41596

PERF: groupby performance regression in 1.2.x #41596

tritemio commented May 20, 2021

INSTALLED VERSIONS

mzeitlin11 commented May 21, 2021

tritemio commented May 24, 2021

simonjayhawkins commented Jun 4, 2021

simonjayhawkins commented Jun 16, 2021

debnathshoham commented Sep 19, 2021

simonjayhawkins commented Sep 21, 2021

PERF: groupby performance regression in 1.2.x #41596

PERF: groupby performance regression in 1.2.x #41596

Comments

tritemio commented May 20, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

mzeitlin11 commented May 21, 2021

tritemio commented May 24, 2021

simonjayhawkins commented Jun 4, 2021

simonjayhawkins commented Jun 16, 2021

debnathshoham commented Sep 19, 2021

simonjayhawkins commented Sep 21, 2021

Output of `pd.show_versions()`