PERF: groupby with string dtype #43634

debnathshoham · 2021-09-18T09:24:34Z

closes PERF: groupby performance regression in 1.2.x #41596
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

In [1]: import numpy as np
   ...:    ...: import pandas as pd
   ...:    ...: pd.__version__
   ...:    ...:
   ...:    ...: cols = list('abcdefghjkl')
   ...:    ...: df = pd.DataFrame(np.random.randint(0, 100, size=(100, len(cols))), columns=cols)
   ...:    ...: df_str = df.astype(str)
   ...:    ...: df_string = df.astype('string')

In [2]: %timeit df_str.groupby('a')[cols[1:]].last()
2.88 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit df_string.groupby('a')[cols[1:]].last()
3.27 ms ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  #PR
54.4 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)    #master

simonjayhawkins · 2021-09-18T10:45:18Z

does this also fix the case mentioned in #41596 (comment)? if so can you post the timings for that too for df_str.groupby('a')[cols[1:]].last() on master and with this PR

debnathshoham · 2021-09-18T10:58:24Z

Hi @simonjayhawkins - no this PR doesn't address the performance slowdown in the object dtype.
It only ensures that string takes the cython-path (similar to the object dtype).

Also note, that I have changed the size of the df in the top comment, wrt the example in the bug report.

simonjayhawkins · 2021-09-18T11:15:51Z

That's okay, but we can't close that issue. will need to open a new one specifically for that case, (regression between 1.2.5 and 1.3) or change the OP to read "xref" instead of "closes"

out of curiosity, did you identify which commit caused the slowdown between 1.1.5 and 1.2?

debnathshoham · 2021-09-18T12:05:16Z

Updated the top comment.
Honestly, I didn't try to find the commit. This fix was suggested in a comment in the bug report.

simonjayhawkins · 2021-09-18T12:17:17Z

So why do we need to special case an specific EA dtype? (Maybe this was done before). In general EA handling should be more generic.

debnathshoham · 2021-09-18T12:27:17Z

I think the current implementation of _ea_wrap_cython_operation works that way, no?

I can see a TODO for a more generic approach.

simonjayhawkins · 2021-09-18T12:30:08Z

yes, it appears that there is intent to make _ea_wrap_cython_operation more generic. I'm just wondering why it worked in 1.1.5

pandas/core/groupby/ops.py

jreback · 2021-09-20T12:57:24Z

pandas/core/groupby/ops.py

@@ -348,6 +349,9 @@ def _ea_wrap_cython_operation(
        elif isinstance(values.dtype, FloatingDtype):
            # FloatingArray
            npvalues = values.to_numpy(values.dtype.numpy_dtype, na_value=np.nan)
+        elif isinstance(values.dtype, StringDtype) and self.how in ["last", "first"]:


are there other functions (e.g. .sum) as well (or just have all string functions hit this path)? do we have sufficient asv's to cover this?

yes, other functions do hit this path.
asv coverage is little spotty, i was not able to find anything with groupby and StringDtype

yes, other functions do hit this path.

ok which ones? what if you remove the self.how entirely

i don't really like the self.how here, this is way too specific.

simonjayhawkins · 2021-09-22T11:28:08Z

can you merge master to fix doc build #43688

simonjayhawkins · 2021-09-22T11:30:23Z

That's okay, but we can't close that issue. will need to open a new one specifically for that case, (regression between 1.2.5 and 1.3) or change the OP to read "xref" instead of "closes"

changed back to "closes" xref #41596 (comment)

…o gh41596

jreback

can you add asvs that hit this case

pandas/core/groupby/ops.py

jreback · 2021-09-22T20:39:09Z

groupby.String.time_str_func: setup: wrong number of arguments (for <bound method String.setup of <benchmarks.groupby.String object at 0x7f6c8c634a30>> in groupby.py:626): expected 2, has 1

in the asvs

jreback · 2021-09-22T20:39:34Z

also merge master as the deprecation warnings on asv's are silenced now

jreback · 2021-09-23T12:19:16Z

seems still some CI / checks are failing, also make sure you merge upstream/master as the patch for some of the deprecation warnings in asv's was not pulled in.

jreback · 2021-09-23T16:22:32Z

thanks @debnathshoham very nice!

jreback · 2021-09-23T16:22:41Z

@meeseeksdev backport 1.3.x

jreback · 2021-09-23T17:39:19Z

@debnathshoham if you wouldn't mind following the procedure above to backport

…#43724)

debnathshoham added 3 commits September 18, 2021 14:51

PERF: groupby with string dtype

248a3f4

Merge branch 'master' into gh41596

b0073a2

included .GroupBy.first

a0ca3a3

simonjayhawkins added this to the 1.3.4 milestone Sep 18, 2021

simonjayhawkins added Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data labels Sep 18, 2021

jreback requested changes Sep 20, 2021

View reviewed changes

debnathshoham mentioned this pull request Sep 21, 2021

ENH: generic implementation of ea_wrap_cython_operation #43682

Closed

debnathshoham added 2 commits September 21, 2021 22:45

Merge branch 'master' into gh41596

c8225e6

remove self.how for StringDtype

200356c

debnathshoham added 2 commits September 22, 2021 17:42

Merge branch 'master' into gh41596

0e231dd

Merge branch 'gh41596' of https://github.com/debnathshoham/pandas int…

d265089

…o gh41596

jreback requested changes Sep 22, 2021

View reviewed changes

pandas/core/groupby/ops.py Show resolved Hide resolved

debnathshoham added 2 commits September 23, 2021 00:36

asv for string groupby

ddf4c7a

all str dtypes

4f428a9

debnathshoham added 2 commits September 23, 2021 12:28

added all params n asv funcs

3d5b2ba

Merge branch 'master' into gh41596

3cdf21b

debnathshoham added 2 commits September 23, 2021 19:53

removed pyarrow; doesn't have minmax

bee0a67

Merge branch 'master' into gh41596

bddddcc

jreback approved these changes Sep 23, 2021

View reviewed changes

jreback merged commit b3e9ae7 into pandas-dev:master Sep 23, 2021

This comment has been minimized.

Sign in to view

lumberbot-app bot added the Still Needs Manual Backport label Sep 23, 2021

This comment has been minimized.

Sign in to view

debnathshoham deleted the gh41596 branch September 23, 2021 16:23

debnathshoham added a commit to debnathshoham/pandas that referenced this pull request Sep 23, 2021

Backport PR pandas-dev#43634: PERF: groupby with string dtype

7103dd9

debnathshoham mentioned this pull request Sep 23, 2021

Backport PR #43634 on branch 1.3.x (PERF: groupby with string dtype)" #43724

Merged

simonjayhawkins pushed a commit that referenced this pull request Sep 24, 2021

Backport PR #43634 on branch 1.3.x (PERF: groupby with string dtype)" (…

7b6a5c4

…#43724)

simonjayhawkins removed the Still Needs Manual Backport label Sep 24, 2021

jorisvandenbossche mentioned this pull request Nov 15, 2021

CI: benchmark build is taking a long time #44450

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: groupby with string dtype #43634

PERF: groupby with string dtype #43634

debnathshoham commented Sep 18, 2021 •

edited by simonjayhawkins

Loading

simonjayhawkins commented Sep 18, 2021

debnathshoham commented Sep 18, 2021

simonjayhawkins commented Sep 18, 2021

debnathshoham commented Sep 18, 2021

simonjayhawkins commented Sep 18, 2021

debnathshoham commented Sep 18, 2021

simonjayhawkins commented Sep 18, 2021

jreback Sep 20, 2021

debnathshoham Sep 21, 2021

jreback Sep 21, 2021

jreback Sep 21, 2021

simonjayhawkins commented Sep 22, 2021

simonjayhawkins commented Sep 22, 2021

jreback left a comment

jreback commented Sep 22, 2021

jreback commented Sep 22, 2021

jreback commented Sep 23, 2021

jreback commented Sep 23, 2021

This comment has been minimized.

jreback commented Sep 23, 2021

This comment has been minimized.

jreback commented Sep 23, 2021

PERF: groupby with string dtype #43634

PERF: groupby with string dtype #43634

Conversation

debnathshoham commented Sep 18, 2021 • edited by simonjayhawkins Loading

simonjayhawkins commented Sep 18, 2021

debnathshoham commented Sep 18, 2021

simonjayhawkins commented Sep 18, 2021

debnathshoham commented Sep 18, 2021

simonjayhawkins commented Sep 18, 2021

debnathshoham commented Sep 18, 2021

simonjayhawkins commented Sep 18, 2021

jreback Sep 20, 2021

Choose a reason for hiding this comment

debnathshoham Sep 21, 2021

Choose a reason for hiding this comment

jreback Sep 21, 2021

Choose a reason for hiding this comment

jreback Sep 21, 2021

Choose a reason for hiding this comment

simonjayhawkins commented Sep 22, 2021

simonjayhawkins commented Sep 22, 2021

jreback left a comment

Choose a reason for hiding this comment

jreback commented Sep 22, 2021

jreback commented Sep 22, 2021

jreback commented Sep 23, 2021

jreback commented Sep 23, 2021

This comment has been minimized.

jreback commented Sep 23, 2021

This comment has been minimized.

jreback commented Sep 23, 2021

debnathshoham commented Sep 18, 2021 •

edited by simonjayhawkins

Loading