Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: groupby with string dtype #43634

Merged
merged 13 commits into from
Sep 23, 2021
Merged

Conversation

debnathshoham
Copy link
Member

@debnathshoham debnathshoham commented Sep 18, 2021

In [1]: import numpy as np
   ...:    ...: import pandas as pd
   ...:    ...: pd.__version__
   ...:    ...:
   ...:    ...: cols = list('abcdefghjkl')
   ...:    ...: df = pd.DataFrame(np.random.randint(0, 100, size=(100, len(cols))), columns=cols)
   ...:    ...: df_str = df.astype(str)
   ...:    ...: df_string = df.astype('string')

In [2]: %timeit df_str.groupby('a')[cols[1:]].last()
2.88 ms ± 70.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit df_string.groupby('a')[cols[1:]].last()
3.27 ms ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  #PR
54.4 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)    #master

@simonjayhawkins simonjayhawkins added this to the 1.3.4 milestone Sep 18, 2021
@simonjayhawkins simonjayhawkins added Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data labels Sep 18, 2021
@simonjayhawkins
Copy link
Member

does this also fix the case mentioned in #41596 (comment)? if so can you post the timings for that too for df_str.groupby('a')[cols[1:]].last() on master and with this PR

@debnathshoham
Copy link
Member Author

Hi @simonjayhawkins - no this PR doesn't address the performance slowdown in the object dtype.
It only ensures that string takes the cython-path (similar to the object dtype).

Also note, that I have changed the size of the df in the top comment, wrt the example in the bug report.

@simonjayhawkins
Copy link
Member

That's okay, but we can't close that issue. will need to open a new one specifically for that case, (regression between 1.2.5 and 1.3) or change the OP to read "xref" instead of "closes"

out of curiosity, did you identify which commit caused the slowdown between 1.1.5 and 1.2?

@debnathshoham
Copy link
Member Author

Updated the top comment.
Honestly, I didn't try to find the commit. This fix was suggested in a comment in the bug report.

@simonjayhawkins
Copy link
Member

So why do we need to special case an specific EA dtype? (Maybe this was done before). In general EA handling should be more generic.

@debnathshoham
Copy link
Member Author

I think the current implementation of _ea_wrap_cython_operation works that way, no?

I can see a TODO for a more generic approach.

@simonjayhawkins
Copy link
Member

yes, it appears that there is intent to make _ea_wrap_cython_operation more generic. I'm just wondering why it worked in 1.1.5

pandas/core/groupby/ops.py Show resolved Hide resolved
@@ -348,6 +349,9 @@ def _ea_wrap_cython_operation(
elif isinstance(values.dtype, FloatingDtype):
# FloatingArray
npvalues = values.to_numpy(values.dtype.numpy_dtype, na_value=np.nan)
elif isinstance(values.dtype, StringDtype) and self.how in ["last", "first"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there other functions (e.g. .sum) as well (or just have all string functions hit this path)? do we have sufficient asv's to cover this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, other functions do hit this path.
asv coverage is little spotty, i was not able to find anything with groupby and StringDtype

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, other functions do hit this path.

ok which ones? what if you remove the self.how entirely

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't really like the self.how here, this is way too specific.

@simonjayhawkins
Copy link
Member

can you merge master to fix doc build #43688

@simonjayhawkins
Copy link
Member

That's okay, but we can't close that issue. will need to open a new one specifically for that case, (regression between 1.2.5 and 1.3) or change the OP to read "xref" instead of "closes"

changed back to "closes" xref #41596 (comment)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add asvs that hit this case

pandas/core/groupby/ops.py Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Sep 22, 2021

groupby.String.time_str_func: setup: wrong number of arguments (for <bound method String.setup of <benchmarks.groupby.String object at 0x7f6c8c634a30>> in groupby.py:626): expected 2, has 1

in the asvs

@jreback
Copy link
Contributor

jreback commented Sep 22, 2021

also merge master as the deprecation warnings on asv's are silenced now

@jreback
Copy link
Contributor

jreback commented Sep 23, 2021

seems still some CI / checks are failing, also make sure you merge upstream/master as the patch for some of the deprecation warnings in asv's was not pulled in.

@jreback jreback merged commit b3e9ae7 into pandas-dev:master Sep 23, 2021
@jreback
Copy link
Contributor

jreback commented Sep 23, 2021

thanks @debnathshoham very nice!

@lumberbot-app

This comment has been minimized.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app

This comment has been minimized.

@debnathshoham debnathshoham deleted the gh41596 branch September 23, 2021 16:23
@jreback
Copy link
Contributor

jreback commented Sep 23, 2021

@debnathshoham if you wouldn't mind following the procedure above to backport

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: groupby performance regression in 1.2.x
3 participants