Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: groupby performance regression in 1.2.x #41596

Closed
1 task
tritemio opened this issue May 20, 2021 · 6 comments · Fixed by #43634
Closed
1 task

PERF: groupby performance regression in 1.2.x #41596

tritemio opened this issue May 20, 2021 · 6 comments · Fixed by #43634
Labels
Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Milestone

Comments

@tritemio
Copy link

  • [ x] I have checked that this issue has not already been reported.

  • [x ] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
pd.__version__

cols = list('abcdefghjkl')
df = pd.DataFrame(np.random.randint(0, 100, size=(1_000_000, len(cols))), columns=cols)
df_str = df.astype(str)
df_string = df.astype('string')

%timeit df_str.groupby('a')[cols[1:]].agg('last')

%timeit df_string.groupby('a')[cols[1:]].agg('last')

Problem description

Pandas 1.2.x is much slower (9x slower) than 1.1.5 in the groupby aggregation above when the columns are of string dtype. When the columns are of object dtype performance are comparable across the two pandas version.

Expected Output

In pandas 1.1.5 this groupby-aggregation is a bit faster with string dtype than with object dtype:

%timeit df_str.groupby('a')[cols[1:]].agg('last')
680 ms ± 3.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_string.groupby('a')[cols[1:]].agg('last')
544 ms ± 3.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Conversely, in pandas 1.2.4 the same groupby-aggregation is 7x slower with string dtype than with object dtype:

%timeit df_str.groupby('a')[cols[1:]].agg('last')
700 ms ± 7.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_string.groupby('a')[cols[1:]].agg('last')
4.93 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I would expect comparable performance between pandas 1.1.5 and 1.2.4, instead we have a large performance regression in 1.2.4 when performing the groupby aggregation with string dtypes.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : b5958ee
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-17-generic
Version : #18-Ubuntu SMP Thu May 6 20:10:11 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.5
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.1
setuptools : 49.6.0.post20210108
Cython : 0.29.23
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.23.1
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 4.0.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : 1.3.23
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : 0.53.1

@tritemio tritemio added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 20, 2021
@mzeitlin11
Copy link
Member

Thanks for looking into this @tritemio! I think the issue may that we're now taking the slow path (non-cython) because of logic here

def _ea_wrap_cython_operation(
raising an error such that we go to the slower fallback. Something like last would work for object type string data, so some dispatch logic for StringDType could be added, though there may be subtleties where that would cause other issues.

@mzeitlin11 mzeitlin11 added ExtensionArray Extending pandas with custom dtypes or arrays. Groupby Performance Memory or execution speed performance Strings String extension data type and string data Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member ExtensionArray Extending pandas with custom dtypes or arrays. labels May 21, 2021
@tritemio
Copy link
Author

@mzeitlin11 thanks for the answer. Yes, the regression is due to a slow python path taken for StringDType in pandas 1.2x, while a fast path was taken until 1.1.5.

I'm attaching the profiling output of the groupby on the two pandas versions. Maybe it can help track down the function at the origin of the slow down.

Note that the traces are the same until the call to _cython_agg_blocks then they take different paths.

# pandas 1.1.5, StringDType
%prun -l 20 -s cumulative  df_string.groupby('a')[cols[1:]].agg('last')

         7137 function calls (7004 primitive calls) in 0.597 seconds

   Ordered by: cumulative time
   List reduced from 413 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.597    0.597 {built-in method builtins.exec}
        1    0.000    0.000    0.597    0.597 <string>:1(<module>)
        1    0.000    0.000    0.596    0.596 generic.py:937(aggregate)
        1    0.000    0.000    0.596    0.596 base.py:281(_aggregate)
        1    0.000    0.000    0.596    0.596 base.py:251(_try_aggregate_string_function)
        1    0.000    0.000    0.596    0.596 groupby.py:1588(last)
        1    0.000    0.000    0.596    0.596 groupby.py:987(_agg_general)
        1    0.000    0.000    0.595    0.595 generic.py:1018(_cython_agg_general)
        1    0.000    0.000    0.595    0.595 generic.py:1026(_cython_agg_blocks)
       10    0.062    0.006    0.593    0.059 ops.py:588(aggregate)
       10    0.001    0.000    0.531    0.053 ops.py:443(_cython_operation)
       10    0.300    0.030    0.300    0.030 ops.py:598(_aggregate)
        1    0.000    0.000    0.146    0.146 ops.py:302(ngroups)
        1    0.000    0.000    0.146    0.146 ops.py:312(result_index)
        1    0.000    0.000    0.146    0.146 grouper.py:573(result_index)
        2    0.000    0.000    0.146    0.073 grouper.py:579(group_index)
        1    0.000    0.000    0.146    0.146 grouper.py:586(_make_codes)
        1    0.009    0.009    0.146    0.146 algorithms.py:518(factorize)
        1    0.000    0.000    0.135    0.135 base.py:801(factorize)
       10    0.000    0.000    0.081    0.008 string_.py:264(astype)

And in pandas 1.2.4:

# pandas 1.2.4, StringDType
%prun -l 20 -s cumulative  df_string.groupby('a')[cols[1:]].agg('last')

320990 function calls (317555 primitive calls) in 5.012 seconds

   Ordered by: cumulative time
   List reduced from 622 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    111/1    0.007    0.000    5.012    5.012 {built-in method builtins.exec}
        1    0.000    0.000    5.011    5.011 generic.py:931(aggregate)
        1    0.000    0.000    5.011    5.011 aggregation.py:549(aggregate)
        1    0.000    0.000    5.011    5.011 base.py:303(_try_aggregate_string_function)
        1    0.000    0.000    5.011    5.011 groupby.py:1705(last)
        1    0.000    0.000    5.011    5.011 groupby.py:1011(_agg_general)
        1    0.000    0.000    5.011    5.011 generic.py:1012(_cython_agg_general)
        1    0.000    0.000    5.010    5.010 generic.py:1020(_cython_agg_blocks)
        2    0.000    0.000    5.010    2.505 managers.py:376(apply)
       10    0.000    0.000    5.009    0.501 blocks.py:372(apply)
       10    0.000    0.000    5.007    0.501 generic.py:1094(blk_func)
       10    0.000    0.000    5.006    0.501 generic.py:1051(py_fallback)
       10    0.000    0.000    5.003    0.500 generic.py:223(aggregate)
       10    0.000    0.000    5.003    0.500 groupby.py:1157(_python_agg_general)
       10    0.155    0.015    4.875    0.488 ops.py:686(agg_series)
       10    0.010    0.001    4.720    0.472 ops.py:735(_aggregate_series_pure_python)
     1010    0.004    0.000    2.963    0.003 ops.py:969(__iter__)
       10    0.000    0.000    2.095    0.209 ops.py:982(_get_sorted_data)
       10    0.000    0.000    2.040    0.204 series.py:791(take)
       11    0.000    0.000    1.942    0.177 _mixins.py:60(take)

For completeness, this is the proilfing output with the object str dtype:

# pandas 1.1.5, object str dtype 
%prun -l 20 -s cumulative df_str.groupby('a')[cols[1:]].agg('last')

4428 function calls (4412 primitive calls) in 0.717 seconds

   Ordered by: cumulative time
   List reduced from 356 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.717    0.717 {built-in method builtins.exec}
        1    0.077    0.077    0.717    0.717 <string>:1(<module>)
        1    0.000    0.000    0.638    0.638 generic.py:937(aggregate)
        1    0.000    0.000    0.638    0.638 base.py:281(_aggregate)
        1    0.000    0.000    0.638    0.638 base.py:251(_try_aggregate_string_function)
        1    0.000    0.000    0.638    0.638 groupby.py:1588(last)
        1    0.000    0.000    0.638    0.638 groupby.py:987(_agg_general)
        1    0.000    0.000    0.638    0.638 generic.py:1018(_cython_agg_general)
        1    0.000    0.000    0.636    0.636 generic.py:1026(_cython_agg_blocks)
        1    0.073    0.073    0.506    0.506 ops.py:588(aggregate)
        1    0.000    0.000    0.433    0.433 ops.py:443(_cython_operation)
        1    0.273    0.273    0.273    0.273 ops.py:598(_aggregate)
        4    0.000    0.000    0.131    0.033 algorithms.py:1640(take_nd)
        1    0.000    0.000    0.129    0.129 generic.py:1656(_get_data_to_aggregate)
        1    0.000    0.000    0.129    0.129 base.py:201(_obj_with_exclusions)
        1    0.000    0.000    0.129    0.129 _decorators.py:307(wrapper)
        1    0.000    0.000    0.129    0.129 frame.py:4017(reindex)
        1    0.000    0.000    0.129    0.129 generic.py:4216(reindex)
        1    0.000    0.000    0.129    0.129 frame.py:3871(_reindex_axes)
        1    0.000    0.000    0.129    0.129 frame.py:3908(_reindex_columns)
# pandas 1.2.4, object str dtype
%prun -l 20 -s cumulative df_str.groupby('a')[cols[1:]].agg('last')

         4646 function calls (4618 primitive calls) in 0.711 seconds

   Ordered by: cumulative time
   List reduced from 413 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.711    0.711 {built-in method builtins.exec}
        1    0.072    0.072    0.711    0.711 <string>:1(<module>)
        1    0.000    0.000    0.620    0.620 generic.py:931(aggregate)
        1    0.000    0.000    0.620    0.620 aggregation.py:549(aggregate)
        1    0.000    0.000    0.620    0.620 base.py:303(_try_aggregate_string_function)
        1    0.000    0.000    0.620    0.620 groupby.py:1705(last)
        1    0.000    0.000    0.620    0.620 groupby.py:1011(_agg_general)
        1    0.000    0.000    0.619    0.619 generic.py:1012(_cython_agg_general)
        1    0.000    0.000    0.617    0.617 generic.py:1020(_cython_agg_blocks)
        2    0.000    0.000    0.511    0.256 managers.py:376(apply)
        1    0.000    0.000    0.510    0.510 blocks.py:372(apply)
        1    0.068    0.068    0.510    0.510 generic.py:1094(blk_func)
        1    0.000    0.000    0.442    0.442 ops.py:550(_cython_operation)
        1    0.284    0.284    0.284    0.284 ops.py:664(_aggregate)
        4    0.000    0.000    0.109    0.027 algorithms.py:1661(take_nd)
        1    0.000    0.000    0.107    0.107 generic.py:1603(_get_data_to_aggregate)
        1    0.000    0.000    0.107    0.107 base.py:226(_obj_with_exclusions)
        1    0.000    0.000    0.107    0.107 _decorators.py:310(wrapper)
        1    0.000    0.000    0.107    0.107 frame.py:4157(reindex)
        1    0.000    0.000    0.107    0.107 generic.py:4564(reindex)


@simonjayhawkins simonjayhawkins added this to the 1.2.5 milestone May 24, 2021
@simonjayhawkins
Copy link
Member

Conversely, in pandas 1.2.4 the same groupby-aggregation is 7x slower

image

appears there is a further slowdown on master for the first case.

@simonjayhawkins
Copy link
Member

@pandas-dev/pandas-core moving this to 1.3 (which will probably get moved to 1.3.1) as no PRs to fix and also an additional performance regression identified for 1.3

@simonjayhawkins simonjayhawkins modified the milestones: 1.2.5, 1.3 Jun 16, 2021
@simonjayhawkins simonjayhawkins changed the title BUG: groupby performance regression in 1.2.x PERF: groupby performance regression in 1.2.x Jun 25, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3, 1.3.1 Jun 30, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.1, 1.3.2 Jul 24, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.2, 1.3.3 Aug 15, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
@debnathshoham
Copy link
Member

I am not seeing much drop in perf between master and 1.1.5 for the object dtype.
Am I missing something @simonjayhawkins ?

In [8]: import numpy as np
   ...: import pandas as pd
   ...: pd.__version__
   ...: 
   ...: cols = list('abcdefghjkl')
   ...: df = pd.DataFrame(np.random.randint(0, 100, size=(1_000_000, len(cols))), columns=cols)
   ...: df_str = df.astype(str)
   ...: df_string = df.astype('string')
   ...: 
   ...: %timeit df_str.groupby('a')[cols[1:]].agg('last')
990 ms ± 242 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: pd.__version__
Out[9]: '1.4.0.dev0+722.ge7e7b40722'
In [6]: import numpy as np
   ...: import pandas as pd
   ...: pd.__version__
   ...: 
   ...: cols = list('abcdefghjkl')
   ...: df = pd.DataFrame(np.random.randint(0, 100, size=(1_000_000, len(cols))), columns=cols)
   ...: df_str = df.astype(str)
   ...: df_string = df.astype('string')
   ...: 
   ...: %timeit df_str.groupby('a')[cols[1:]].agg('last')
1.05 s ± 409 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: pd.__version__
Out[7]: '1.1.5'

@simonjayhawkins
Copy link
Member

looks like that got fixed...

%timeit df_str.groupby('a')[cols[1:]].agg('last')
# 769 ms ± 43.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- master (1/6)
# 435 ms ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- 1.2.4
# 417 ms ± 12.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- 1.1.5

# 316 ms ± 3.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <-- master (21/9)
# 325 ms ± 9.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.3
# 318 ms ± 5.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.2
# 321 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.1
# 787 ms ± 46.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) <- 1.3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants