BUG: freq set to NONE when resampling pd.Multiindex (introduced in v1.1.0) #35563

andreas-vester · 2020-08-05T09:08:42Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Update/Summary

When dealing with a pd.multiindex the frequency df.loc[i].index.freq has not been set. This behavior has been introduced in v1.1.0

idx = pd.Index(range(2), name="A")
dti = pd.date_range("2020-01-01", periods=7, freq="D", name="B")
mi = pd.MultiIndex.from_product([idx, dti])

df = pd.DataFrame(np.random.randn(14, 2), index=mi)

>>> df.loc[0].index

df.loc[0].index matches dti, so it would be nice to get the freq="D" back on that.

using code sample in #35563 (comment) points to #31315

c988567 is the first bad commit
commit c988567
Author: jbrockmendel jbrockmendel@gmail.com
Date: Sun Feb 9 07:01:20 2020 -0800
REF: tighten what we accept in TimedeltaIndex._simple_new (#31315)

Original post

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd

idx_level_0 = np.repeat([1, 2], 5)
dates = np.tile(
    ["2020-01-01", "2020-01-02", "2020-01-04", "2020-01-06", "2020-01-07"], 2
)
values1 = [1, 2, 3, 4, 5]
values2 = [6, 7, 8, 9, 10]

df = pd.DataFrame(
    {"idx_level_0": idx_level_0, "dates": dates, "values": [*values1, *values2]}
)
df["dates"] = pd.to_datetime(df["dates"])
df = df.set_index(["idx_level_0", "dates"], drop=True)

df = df.groupby("idx_level_0").resample("D", level="dates").last()

# The following assertion is working properly in pandas v1.0.5
# It throws an error in pandas v1.1.0
assert df.index.get_level_values(1).freq == "D"

Problem description

When resampling a groupby-object, the frequency will incorrectly be set to None.

Expected Output

The frequency should be set according to the resampled frequency.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : d9fff27
python : 3.8.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : de_DE.cp1252
pandas : 1.1.0
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 45.3.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-08-05T11:22:42Z

I think the output on pandas 1.1.0 (no freq) is correct. This can't have a freq since the values are repeated.

In [14]: df.index.get_level_values(1)
Out[14]:
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-01',
               '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05',
               '2020-01-06', '2020-01-07'],
              dtype='datetime64[ns]', name='dates', freq=None)

OTOH, df.index.levels[1].freq is correctly <Day>.

andreas-vester · 2020-08-05T13:36:23Z

Well, you're correct in the sense that the values are repeated if you take a look at the whole secondary index. However, they are not repeated when taking into account the multiindex. The freq is D for every group with respect to the first index level.

Pandas v1.0.5 treated this differently (freq is D). I would prefer to see the previous behavior. This (intended) change in pandas breaks code on my existing projects.

TomAugspurger · 2020-08-05T18:19:47Z

I think the previous behavior is incorrect, since the values don't represent a timeseries with daily frequency. (cc @jbrockmendel for a second opinion).

andreas-vester · 2020-08-05T18:43:31Z

I disagree. Here's my real world example:

Consider index level 0 to be equity tickers, index level 1 are trading dates (days, weeks, months, etc.), and the values are closing prices. That is, you have multiple time series below each other. This is a tidy (long) dataframe. Applying the resample method to each groupby object should yield a specific frequency.

I am actually wondering on what grounds the different behavior has been introduced into v1.1.0?

TomAugspurger · 2020-08-06T18:05:42Z

@Kraxelhuber all your points correctly describe why .index.levels[1].freq is Day. But that doesn't apply to the index returned by .get_level_values() which doesn't have a well-defined frequency (not all the elements are equally spaced by some value).

I am actually wondering on what grounds the different behavior has been introduced into v1.1.0?

There were several PRs fixing cases where freq was incorrectly propagated (most or all by @jbrockmendel IIRC), e.g. #30511.

jbrockmendel · 2020-08-06T23:53:36Z

@TomAugspurger is right about the distinction between get_level_values().freq and levels[1].freq.

That said, I think there is a behavior that can be improved:

idx = pd.Index(range(2), name="A")
dti = pd.date_range("2020-01-01", periods=7, freq="D", name="B")
mi = pd.MultiIndex.from_product([idx, dti])

df = pd.DataFrame(np.random.randn(14, 2), index=mi)

>>> df.loc[0].index

df.loc[0].index matches dti, so it would be nice to get the freq="D" back on that.

simonjayhawkins · 2020-08-07T08:45:16Z

df.loc[0].index matches dti, so it would be nice to get the freq="D" back on that.

can confirm that was the behaviour in 1.0.5

>>> pd.__version__
'1.0.5'
>>>
>>> idx = pd.Index(range(2), name="A")
>>> dti = pd.date_range("2020-01-01", periods=7, freq="D", name="B")
>>> mi = pd.MultiIndex.from_product([idx, dti])
>>>
>>> df = pd.DataFrame(np.random.randn(14, 2), index=mi)
>>>
>>> df.loc[0].index
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07'],
              dtype='datetime64[ns]', name='B', freq='D')
>>>

andreas-vester · 2020-08-07T09:34:18Z

@Kraxelhuber all your points correctly describe why .index.levels[1].freq is Day. But that doesn't apply to the index returned by .get_level_values() which doesn't have a well-defined frequency (not all the elements are equally spaced by some value).

I again reflected on your reasoning regarding .get_level_values(). I agree with you.

Anyway, what I am really after is the fact that every group (in terms of a multiindex) can have its own frequency. Actually, these frequencies may be different or not existent for each individual group. I think this is related to the point made by @jbrockmendel

import numpy as np
import pandas as pd

idx_level_0 = np.repeat([1, 2], 5)
dates = [
    "2020-01-01",
    "2020-01-02",
    "2020-01-03",
    "2020-01-04",
    "2020-01-05",
    "2020-03-01",
    "2020-03-08",
    "2020-03-15",
    "2020-03-22",
    "2020-03-29",
]
values1 = [1, 2, 3, 4, 5]
values2 = [6, 7, 8, 9, 10]

df = pd.DataFrame(
    {"idx_level_0": idx_level_0, "dates": dates, "values": [*values1, *values2]}
)
df["dates"] = pd.to_datetime(df["dates"])
df = df.set_index(["idx_level_0", "dates"], drop=True)

# df.loc[1].index.freq should yield "D"
# df.loc[1].index.freq should yield "W"

TomAugspurger · 2020-08-07T11:42:15Z

Great, I think we're all agreed on the point that df.loc[i].index.freq should be set, and is a regression. @Kraxelhuber can you edit the original post to reflect that, and that it's a regression from 1.0.5?

simonjayhawkins · 2020-08-07T13:05:30Z

using code sample in #35563 (comment) points to #31315

c988567 is the first bad commit
commit c988567
Author: jbrockmendel jbrockmendel@gmail.com
Date: Sun Feb 9 07:01:20 2020 -0800

REF: tighten what we accept in TimedeltaIndex._simple_new (#31315)

simonjayhawkins · 2020-09-07T09:36:31Z

moved off 1.1.2 milestone (scheduled for this week) as no PRs to fix in the pipeline

simonjayhawkins · 2020-10-05T12:56:13Z

moved off 1.1.3 milestone (overdue) as no PRs to fix in the pipeline

simonjayhawkins · 2020-10-29T14:49:58Z

moved off 1.1.4 milestone (scheduled for release tomorrow) as no PRs to fix in the pipeline

simonjayhawkins · 2020-11-26T21:43:43Z

it appears that the frequency is lost is in the multiindex construction...

>>> import pandas as pd
>>> pd.__version__
'1.2.0.dev0+1320.g28634289c8'
>>> idx = pd.Index(range(2), name="A")
>>> dti = pd.date_range("2020-01-01", periods=7, freq="D", name="B")
>>> mi = pd.MultiIndex.from_product([idx, dti])
>>> mi.levels[1]
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07'],
              dtype='datetime64[ns]', name='B', freq=None)

>>> import pandas as pd
>>> pd.__version__
'1.0.5'
>>> idx = pd.Index(range(2), name="A")
>>> dti = pd.date_range("2020-01-01", periods=7, freq="D", name="B")
>>> mi = pd.MultiIndex.from_product([idx, dti])
>>> mi.levels[1]
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07'],
              dtype='datetime64[ns]', name='B', freq='D')

it is actually lost in the _shallow_copy where changes were made in #31315

the values passed come from factorization.

since there have been several refactors to _shallow_copy/_simple_new since, i'm not sure that we want to (or can easily) revert those changes

so it may be easier to ensure that the freq is retained in the factorisation step. @jbrockmendel thoughts?

> c:\users\simon\pandas\pandas\core\indexes\extension.py(221)_shallow_copy()
-> name = self.name if name is lib.no_default else name
(Pdb) a
self = DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07'],
              dtype='datetime64[ns]', name='B', freq='D')
values = <DatetimeArray>
['2020-01-01 00:00:00', '2020-01-02 00:00:00', '2020-01-03 00:00:00',
 '2020-01-04 00:00:00', '2020-01-05 00:00:00', '2020-01-06 00:00:00',
 '2020-01-07 00:00:00']
Length: 7, dtype: datetime64[ns]
name = None
(Pdb) values.freq
(Pdb)

TomAugspurger · 2020-11-26T21:46:04Z

Retaining the freq through factorize makes sense to me... IIUC, if the input index had a freq, the output index will have an identical freq, right? Since for the input to have a freq it must be unique and monotonic.

jbrockmendel · 2020-11-26T21:51:52Z

That's the crux of #33830. I think retaining freq makes sense.

simonjayhawkins · 2020-11-26T22:18:13Z

@jbrockmendel do you have time to look into this? I won't be investigating further today.

jbrockmendel · 2020-11-26T22:21:47Z

strong maybe

phofl · 2020-11-28T22:51:47Z

@jbrockmendel I have a question losely related to the ops question:

dti = date_range(start="1/1/2005", end="12/1/2005", freq="M")
dti2 = dti[[1, 3, 5]]
dti3 = dti[slice(1, 5, 2)]

The first one does not retain a freq while the second one does. This seems odd to me, because both are doing exactly the same from a user facing perspective. If we get a listlike indexer, which is essentially equal to a slice, should we retain the freq then? Could implement this relatively easy in the part where we are inferring the freq for slice like indexers

jbrockmendel · 2020-11-29T00:12:45Z

If we get a listlike indexer, which is essentially equal to a slice

The difference is that when we do dti[some_slice] we get a view on the underlying data, whereas dti[listlike] always makes a copy. Not necessarily a major user-facing difference, but it is non-trivially more performant.

You're right it wouldn't be that hard to do a lib.maybe_indices_to_slice call in DatetimeLikeArrayMixin._get_getitem_freq, but the downside is it would hurt performance. I don't care that much, so if you feel strongly about it, go ahead and open an issue.

phofl · 2020-11-29T00:19:27Z

Thx for the explanation. Do not feel strongly about it, just stumbled across when working on #27180 and wondered about the different behavior.

andreas-vester added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 5, 2020

TomAugspurger added Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Aug 5, 2020

simonjayhawkins added this to the 1.1.1 milestone Aug 6, 2020

simonjayhawkins added Frequency DateOffsets Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex and removed Needs Info Clarification about behavior needed to assess issue labels Aug 7, 2020

simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Aug 7, 2020

simonjayhawkins modified the milestones: 1.1.1, 1.1.2 Aug 20, 2020

simonjayhawkins modified the milestones: 1.1.2, 1.1.3 Sep 7, 2020

simonjayhawkins modified the milestones: 1.1.3, 1.1.4 Oct 5, 2020

simonjayhawkins mentioned this issue Oct 21, 2020

REGR: resample apply fails with KeyError/AttributeError for function that works on DataFrame but fails on Series #36951

Closed

simonjayhawkins modified the milestones: 1.1.4, 1.1.5 Oct 29, 2020

This was referenced Nov 27, 2020

API: EA/Index/Series.factorize should have same kwargs as pd.factorize #33838

Closed

API: preserve freq in DTI/TDI.factorize #38120

Merged

jreback closed this as completed in #38120 Nov 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: freq set to NONE when resampling pd.Multiindex (introduced in v1.1.0) #35563

BUG: freq set to NONE when resampling pd.Multiindex (introduced in v1.1.0) #35563

andreas-vester commented Aug 5, 2020 •

edited

Loading

INSTALLED VERSIONS

TomAugspurger commented Aug 5, 2020

andreas-vester commented Aug 5, 2020

TomAugspurger commented Aug 5, 2020

andreas-vester commented Aug 5, 2020

TomAugspurger commented Aug 6, 2020

jbrockmendel commented Aug 6, 2020

simonjayhawkins commented Aug 7, 2020

andreas-vester commented Aug 7, 2020

TomAugspurger commented Aug 7, 2020

simonjayhawkins commented Aug 7, 2020

simonjayhawkins commented Sep 7, 2020

simonjayhawkins commented Oct 5, 2020

simonjayhawkins commented Oct 29, 2020

simonjayhawkins commented Nov 26, 2020

TomAugspurger commented Nov 26, 2020

jbrockmendel commented Nov 26, 2020

simonjayhawkins commented Nov 26, 2020

jbrockmendel commented Nov 26, 2020

phofl commented Nov 28, 2020

jbrockmendel commented Nov 29, 2020

phofl commented Nov 29, 2020

BUG: freq set to NONE when resampling pd.Multiindex (introduced in v1.1.0) #35563

BUG: freq set to NONE when resampling pd.Multiindex (introduced in v1.1.0) #35563

Comments

andreas-vester commented Aug 5, 2020 • edited Loading

Update/Summary

Original post

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Aug 5, 2020

andreas-vester commented Aug 5, 2020

TomAugspurger commented Aug 5, 2020

andreas-vester commented Aug 5, 2020

TomAugspurger commented Aug 6, 2020

jbrockmendel commented Aug 6, 2020

simonjayhawkins commented Aug 7, 2020

andreas-vester commented Aug 7, 2020

TomAugspurger commented Aug 7, 2020

simonjayhawkins commented Aug 7, 2020

simonjayhawkins commented Sep 7, 2020

simonjayhawkins commented Oct 5, 2020

simonjayhawkins commented Oct 29, 2020

simonjayhawkins commented Nov 26, 2020

TomAugspurger commented Nov 26, 2020

jbrockmendel commented Nov 26, 2020

simonjayhawkins commented Nov 26, 2020

jbrockmendel commented Nov 26, 2020

phofl commented Nov 28, 2020

jbrockmendel commented Nov 29, 2020

phofl commented Nov 29, 2020

andreas-vester commented Aug 5, 2020 •

edited

Loading

Output of `pd.show_versions()`