BUG/REG: RollingGroupby MultiIndex levels dropped #38737

mroeschke · 2020-12-27T23:44:21Z

closes BUG: MultiIndex RollingGroupby returns only one level of index #38523
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This was accidentally caused by #37661 where I thought #37641 was a bug when it was actually the correct behavior.

Therefore, I had to change the behavior of some prior tests.

jreback · 2020-12-28T01:03:15Z

looks like

pls check that perf is not regressed

mroeschke · 2020-12-28T07:02:12Z

For the relevant ASV

$ asv continuous -f 1.1 upstream/master HEAD -b rolling.GroupbyLargeGroups

· Creating environments
· Discovering benchmarks
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For pandas commit 9f1a41de <master> (round 1/2):
[  0.00%] ·· Building for conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt........................................................................
[  0.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running (rolling.GroupbyLargeGroups.time_rolling_multiindex_creation--).
[ 25.00%] · For pandas commit 68beb5a5 <bug/rolling_groupby> (round 1/2):
[ 25.00%] ·· Building for conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 25.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running (rolling.GroupbyLargeGroups.time_rolling_multiindex_creation--).
[ 50.00%] · For pandas commit 68beb5a5 <bug/rolling_groupby> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· rolling.GroupbyLargeGroups.time_rolling_multiindex_creation                                                     30.3±0.6ms
[ 75.00%] · For pandas commit 9f1a41de <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 75.00%] ·· Benchmarking conda-py3.8-Cython0.29.21-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· rolling.GroupbyLargeGroups.time_rolling_multiindex_creation                                                       26.5±2ms

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

jorisvandenbossche · 2020-12-28T10:07:07Z

pandas/tests/window/test_groupby.py

+                    ("val1", "val1", "val1", "val1"),
+                    ("val2", "val2", "val2", "val2"),
+                ],
+                names=["idx1", "idx2", "idx1", "idx2"],


Are we sure this is the behaviour we want?
It might be the most consistent, but it's also kind of useless to repeat those index levels .. So we should at least have a way to avoid getting that?

I would say the consistency is more maintainable on our end compared to additionally including logic to de-duplicate index levels given some condition.

I would prefer the user explicitly call droplevel.

hmm

In [134]: df.groupby(level=0).transform('max') Out[134]: Max Speed Animal Type Falcon Captive 390.0 Wild 390.0 Parrot Captive 30.0 Wild 30.0

so i think we are doing some magic in groupby for this. These are conceptually similar operations.

Before indexers were implemented for groupby().rolling(), this was the result:

In [1]: import pandas as pd In [2]: pd.__version__ Out[2]: '1.0.5' In [3]: from pandas import * In [4]: arrays = [ ...: ["Falcon", "Falcon", "Parrot", "Parrot"], ...: ["Captive", "Wild", "Captive", "Wild"], ...: ] ...: index = MultiIndex.from_arrays(arrays, names=("Animal", "Type")) ...: df = DataFrame({"Max Speed": [390.0, 350.0, 30.0, 20.0]}, index=index) ...: result = df.groupby(level=0)["Max Speed"].rolling(2).sum() In [5]: result Out[5]: Animal Animal Type Falcon Falcon Captive NaN Wild 740.0 Parrot Parrot Captive NaN Wild 50.0 Name: Max Speed, dtype: float64

which I think we should be trying to match. Though I'm not sure if we have solid conventions of the resulting index when using groupby.

yep i see that. ok i think that we should revert for 1.2.x and then decide for 1.3 is prob ok. i am leaning towards have the same as groupby here.

Okay once this is merged in I can create another issue to discuss what the index behavior should be for groupby rolling with duplicate index levels.

When comparing to 1.0.5 behaviour, we also had:

In [5]: s.groupby(["idx1", "idx2"], group_keys=False).rolling(1).mean() Out[5]: idx1 idx2 val1 val1 1.0 val1 2.0 val2 val2 3.0 dtype: float64 In [9]: pd.__version__ Out[9]: '1.0.5'

So this PR then deviates from that for this case.

(I know the influence of group_keys=False has been considered a bug, but we could also reconsider that, since the above seems to actually give the desired result?)

The group_keys result different I think is due to the implementation change in groupby().rolling()

Before groupby().rolling() under the hood was groupby().apply(lambda x: x.rolling()...) and therefore group_keys impacted the result (since the argument is only applicable for groupby().apply()).

After groupby().rolling() moved away from using apply, group_keys didn't impact the result.

So IMO, group_keys shouldn't have ever really influenced the result since groupby().apply() was never called directly from the user.

can you add testing for group_keys in any event?

Yes that's parameterized here already: https://github.com/pandas-dev/pandas/pull/38737/files#diff-e338c43cbd06b849f3d6a1b97cd787a48770c616b627f33eb20f67a6fc56b116R559

jreback · 2020-12-28T21:10:42Z

for future issue (these are various return values now, using 1.0.5)

In [140]: df.reset_index().groupby('Animal').rolling(2).sum()                                                                                           
Out[140]: 
          Max Speed
Animal             
Falcon 0        NaN
       1      740.0
Parrot 2        NaN
       3       50.0

In [141]: df.reset_index(level=0).groupby('Animal').rolling(2).sum()                                                                                    
Out[141]: 
                Max Speed
Animal Type              
Falcon Captive        NaN
       Wild         740.0
Parrot Captive        NaN
       Wild          50.0

In [142]: df.reset_index().groupby('Animal', as_index=False).rolling(2).sum()                                                                           
Out[142]: 
     Max Speed
0 0        NaN
  1      740.0
1 2        NaN
  3       50.0

jreback · 2020-12-29T18:56:29Z

thanks @mroeschke pls open an issue for 1.3 to discuss what we should do with this.

jreback · 2020-12-29T18:56:40Z

@meeseeksdev backport 1.2.x

…ls dropped

jorisvandenbossche · 2020-12-29T19:00:26Z

Before groupby().rolling() under the hood was groupby().apply(lambda x: x.rolling()...) and therefore group_keys impacted the result (since the argument is only applicable for groupby().apply()).

After groupby().rolling() moved away from using apply, group_keys didn't impact the result.

So IMO, group_keys shouldn't have ever really influenced the result since groupby().apply() was never called directly from the user.

You could also interpret that as "because of how it was implemented, group_keys has always worked for .rolling() in practice (so it basically worked for apply() ànd rolling()), so the fact that it no longer works now is a regression"

I don't think we should hurry in adding this fix to 1.2.x. Let's first properly discuss what behaviour we want (because we could end up going back again to the previous behaviour ..)

jorisvandenbossche · 2020-12-29T19:35:36Z

Exploring the return value across the recent pandas versions for one of the cases: https://nbviewer.jupyter.org/gist/jorisvandenbossche/500ff1d082c288c378dc1972e24e6b4e

…#38784) Co-authored-by: Matthew Roeschke <emailformattr@gmail.com>

…#38737)" This reverts commit a37f1a4.

…39191) This reverts commit a37f1a4.

…dex levels dropped (pandas-dev#38737)"

… dropped (#38737)" (#39198)

Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

…#38737)" (pandas-dev#39191) This reverts commit a37f1a4.

BUG/REG: RollingGroupby MultiIndex levels dropped

873317f

mroeschke added this to the 1.2.1 milestone Dec 27, 2020

mroeschke added Bug Window rolling, ewma, expanding labels Dec 27, 2020

Merge remote-tracking branch 'upstream/master' into bug/rolling_groupby

1c1c96c

Merge remote-tracking branch 'upstream/master' into bug/rolling_groupby

68beb5a

jorisvandenbossche reviewed Dec 28, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into bug/rolling_groupby

ac8612c

mroeschke and others added 2 commits December 28, 2020 18:33

Merge remote-tracking branch 'upstream/master' into bug/rolling_groupby

b0f3936

Merge branch 'master' into bug/rolling_groupby

de68fa6

jreback merged commit a37f1a4 into pandas-dev:master Dec 29, 2020

meeseeksmachine mentioned this pull request Dec 29, 2020

Backport PR #38737 on branch 1.2.x (BUG/REG: RollingGroupby MultiIndex levels dropped) #38784

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Dec 29, 2020

Backport PR pandas-dev#38737: BUG/REG: RollingGroupby MultiIndex leve…

d3ad193

…ls dropped

mroeschke deleted the bug/rolling_groupby branch December 29, 2020 19:24

simonjayhawkins pushed a commit that referenced this pull request Dec 31, 2020

Backport PR #38737: BUG/REG: RollingGroupby MultiIndex levels dropped (…

c70668c

…#38784) Co-authored-by: Matthew Roeschke <emailformattr@gmail.com>

venaturum mentioned this pull request Jan 1, 2021

BUG: rolling does not accept MultiIndex name #38877

Closed

3 tasks

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this pull request Jan 15, 2021

Revert "BUG/REG: RollingGroupby MultiIndex levels dropped (pandas-dev…

924d75f

…#38737)" This reverts commit a37f1a4.

simonjayhawkins mentioned this pull request Jan 15, 2021

Revert "BUG/REG: RollingGroupby MultiIndex levels dropped (#38737)" #39191

Merged

4 tasks

jreback pushed a commit that referenced this pull request Jan 16, 2021

Revert "BUG/REG: RollingGroupby MultiIndex levels dropped (#38737)" (#…

9d13997

…39191) This reverts commit a37f1a4.

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 16, 2021

Backport PR pandas-dev#39191: Revert "BUG/REG: RollingGroupby MultiIn…

34177bc

…dex levels dropped (pandas-dev#38737)"

meeseeksmachine mentioned this pull request Jan 16, 2021

Backport PR #39191 on branch 1.2.x (Revert "BUG/REG: RollingGroupby MultiIndex levels dropped (#38737)") #39198

Merged

jreback pushed a commit that referenced this pull request Jan 16, 2021

Backport PR #39191: Revert "BUG/REG: RollingGroupby MultiIndex levels…

dd353a1

… dropped (#38737)" (#39198)

simonjayhawkins mentioned this pull request Jan 16, 2021

BUG: MultiIndex RollingGroupby returns only one level of index #38523

Closed

3 tasks

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

BUG/REG: RollingGroupby MultiIndex levels dropped (pandas-dev#38737)

b52294e

Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

Revert "BUG/REG: RollingGroupby MultiIndex levels dropped (pandas-dev…

1778605

…#38737)" (pandas-dev#39191) This reverts commit a37f1a4.

mroeschke mentioned this pull request Mar 31, 2021

BUG: RollingGroupby MultiIndex levels dropped #40701

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/REG: RollingGroupby MultiIndex levels dropped #38737

BUG/REG: RollingGroupby MultiIndex levels dropped #38737

mroeschke commented Dec 27, 2020

jreback commented Dec 28, 2020

mroeschke commented Dec 28, 2020

jorisvandenbossche Dec 28, 2020

mroeschke Dec 28, 2020

jreback Dec 28, 2020

mroeschke Dec 28, 2020

jreback Dec 28, 2020

mroeschke Dec 28, 2020

jorisvandenbossche Dec 28, 2020

mroeschke Dec 28, 2020

jreback Dec 29, 2020

mroeschke Dec 29, 2020

jreback commented Dec 28, 2020

jreback commented Dec 29, 2020

jreback commented Dec 29, 2020

jorisvandenbossche commented Dec 29, 2020

jorisvandenbossche commented Dec 29, 2020

BUG/REG: RollingGroupby MultiIndex levels dropped #38737

BUG/REG: RollingGroupby MultiIndex levels dropped #38737

Conversation

mroeschke commented Dec 27, 2020

jreback commented Dec 28, 2020

mroeschke commented Dec 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 28, 2020

jreback commented Dec 29, 2020

jreback commented Dec 29, 2020

jorisvandenbossche commented Dec 29, 2020

jorisvandenbossche commented Dec 29, 2020