Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/REG: RollingGroupby MultiIndex levels dropped #38737

Merged
merged 6 commits into from
Dec 29, 2020

Conversation

mroeschke
Copy link
Member

This was accidentally caused by #37661 where I thought #37641 was a bug when it was actually the correct behavior.

Therefore, I had to change the behavior of some prior tests.

@mroeschke mroeschke added this to the 1.2.1 milestone Dec 27, 2020
@mroeschke mroeschke added Bug Window rolling, ewma, expanding labels Dec 27, 2020
@jreback
Copy link
Contributor

jreback commented Dec 28, 2020

looks like

pls check that perf is not regressed

@mroeschke
Copy link
Member Author

For the relevant ASV

$ asv continuous -f 1.1 upstream/master HEAD -b rolling.GroupbyLargeGroups

· Creating environments
· Discovering benchmarks
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For pandas commit 9f1a41de <master> (round 1/2):
[  0.00%] ·· Building for conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt........................................................................
[  0.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running (rolling.GroupbyLargeGroups.time_rolling_multiindex_creation--).
[ 25.00%] · For pandas commit 68beb5a5 <bug/rolling_groupby> (round 1/2):
[ 25.00%] ·· Building for conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 25.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running (rolling.GroupbyLargeGroups.time_rolling_multiindex_creation--).
[ 50.00%] · For pandas commit 68beb5a5 <bug/rolling_groupby> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· rolling.GroupbyLargeGroups.time_rolling_multiindex_creation                                                     30.3±0.6ms
[ 75.00%] · For pandas commit 9f1a41de <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 75.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· rolling.GroupbyLargeGroups.time_rolling_multiindex_creation                                                       26.5±2ms

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

("val1", "val1", "val1", "val1"),
("val2", "val2", "val2", "val2"),
],
names=["idx1", "idx2", "idx1", "idx2"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure this is the behaviour we want?
It might be the most consistent, but it's also kind of useless to repeat those index levels .. So we should at least have a way to avoid getting that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say the consistency is more maintainable on our end compared to additionally including logic to de-duplicate index levels given some condition.

I would prefer the user explicitly call droplevel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm

In [134]: df.groupby(level=0).transform('max')                                                                                                          
Out[134]: 
                Max Speed
Animal Type              
Falcon Captive      390.0
       Wild         390.0
Parrot Captive       30.0
       Wild          30.0

so i think we are doing some magic in groupby for this. These are conceptually similar operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before indexers were implemented for groupby().rolling(), this was the result:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.0.5'

In [3]: from pandas import *

In [4]:         arrays = [
   ...:             ["Falcon", "Falcon", "Parrot", "Parrot"],
   ...:             ["Captive", "Wild", "Captive", "Wild"],
   ...:         ]
   ...:         index = MultiIndex.from_arrays(arrays, names=("Animal", "Type"))
   ...:         df = DataFrame({"Max Speed": [390.0, 350.0, 30.0, 20.0]}, index=index)
   ...:         result = df.groupby(level=0)["Max Speed"].rolling(2).sum()

In [5]: result
Out[5]:
Animal  Animal  Type
Falcon  Falcon  Captive      NaN
                Wild       740.0
Parrot  Parrot  Captive      NaN
                Wild        50.0
Name: Max Speed, dtype: float64

which I think we should be trying to match. Though I'm not sure if we have solid conventions of the resulting index when using groupby.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep i see that. ok i think that we should revert for 1.2.x and then decide for 1.3 is prob ok. i am leaning towards have the same as groupby here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay once this is merged in I can create another issue to discuss what the index behavior should be for groupby rolling with duplicate index levels.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When comparing to 1.0.5 behaviour, we also had:

In [5]: s.groupby(["idx1", "idx2"], group_keys=False).rolling(1).mean() 
Out[5]: 
idx1  idx2
val1  val1    1.0
      val1    2.0
val2  val2    3.0
dtype: float64

In [9]: pd.__version__   
Out[9]: '1.0.5'

So this PR then deviates from that for this case.

(I know the influence of group_keys=False has been considered a bug, but we could also reconsider that, since the above seems to actually give the desired result?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The group_keys result different I think is due to the implementation change in groupby().rolling()

Before groupby().rolling() under the hood was groupby().apply(lambda x: x.rolling()...) and therefore group_keys impacted the result (since the argument is only applicable for groupby().apply()).

After groupby().rolling() moved away from using apply, group_keys didn't impact the result.

So IMO, group_keys shouldn't have ever really influenced the result since groupby().apply() was never called directly from the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add testing for group_keys in any event?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Copy link
Contributor

jreback commented Dec 28, 2020

for future issue (these are various return values now, using 1.0.5)

In [140]: df.reset_index().groupby('Animal').rolling(2).sum()                                                                                           
Out[140]: 
          Max Speed
Animal             
Falcon 0        NaN
       1      740.0
Parrot 2        NaN
       3       50.0

In [141]: df.reset_index(level=0).groupby('Animal').rolling(2).sum()                                                                                    
Out[141]: 
                Max Speed
Animal Type              
Falcon Captive        NaN
       Wild         740.0
Parrot Captive        NaN
       Wild          50.0

In [142]: df.reset_index().groupby('Animal', as_index=False).rolling(2).sum()                                                                           
Out[142]: 
     Max Speed
0 0        NaN
  1      740.0
1 2        NaN
  3       50.0

@jreback jreback merged commit a37f1a4 into pandas-dev:master Dec 29, 2020
@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

thanks @mroeschke pls open an issue for 1.3 to discuss what we should do with this.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

@meeseeksdev backport 1.2.x

@jorisvandenbossche
Copy link
Member

Before groupby().rolling() under the hood was groupby().apply(lambda x: x.rolling()...) and therefore group_keys impacted the result (since the argument is only applicable for groupby().apply()).

After groupby().rolling() moved away from using apply, group_keys didn't impact the result.

So IMO, group_keys shouldn't have ever really influenced the result since groupby().apply() was never called directly from the user.

You could also interpret that as "because of how it was implemented, group_keys has always worked for .rolling() in practice (so it basically worked for apply() ànd rolling()), so the fact that it no longer works now is a regression"

I don't think we should hurry in adding this fix to 1.2.x. Let's first properly discuss what behaviour we want (because we could end up going back again to the previous behaviour ..)

@mroeschke mroeschke deleted the bug/rolling_groupby branch December 29, 2020 19:24
@jorisvandenbossche
Copy link
Member

Exploring the return value across the recent pandas versions for one of the cases: https://nbviewer.jupyter.org/gist/jorisvandenbossche/500ff1d082c288c378dc1972e24e6b4e

simonjayhawkins pushed a commit that referenced this pull request Dec 31, 2020
…#38784)

Co-authored-by: Matthew Roeschke <emailformattr@gmail.com>
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Jan 15, 2021
jreback pushed a commit that referenced this pull request Jan 16, 2021
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 16, 2021
jreback pushed a commit that referenced this pull request Jan 16, 2021
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: MultiIndex RollingGroupby returns only one level of index
4 participants