-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: freq set to NONE when resampling pd.Multiindex (introduced in v1.1.0) #35563
Comments
I think the output on pandas 1.1.0 (no freq) is correct. This can't have a freq since the values are repeated. In [14]: df.index.get_level_values(1)
Out[14]:
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
'2020-01-05', '2020-01-06', '2020-01-07', '2020-01-01',
'2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05',
'2020-01-06', '2020-01-07'],
dtype='datetime64[ns]', name='dates', freq=None) OTOH, |
Well, you're correct in the sense that the values are repeated if you take a look at the whole secondary index. However, they are not repeated when taking into account the multiindex. The freq is D for every group with respect to the first index level. Pandas v1.0.5 treated this differently (freq is D). I would prefer to see the previous behavior. This (intended) change in pandas breaks code on my existing projects. |
I think the previous behavior is incorrect, since the values don't represent a timeseries with daily frequency. (cc @jbrockmendel for a second opinion). |
I disagree. Here's my real world example: Consider index level 0 to be equity tickers, index level 1 are trading dates (days, weeks, months, etc.), and the values are closing prices. That is, you have multiple time series below each other. This is a tidy (long) dataframe. Applying the resample method to each groupby object should yield a specific frequency. I am actually wondering on what grounds the different behavior has been introduced into v1.1.0? |
@Kraxelhuber all your points correctly describe why
There were several PRs fixing cases where |
@TomAugspurger is right about the distinction between That said, I think there is a behavior that can be improved:
|
can confirm that was the behaviour in 1.0.5
|
I again reflected on your reasoning regarding Anyway, what I am really after is the fact that every group (in terms of a multiindex) can have its own frequency. Actually, these frequencies may be different or not existent for each individual group. I think this is related to the point made by @jbrockmendel import numpy as np
import pandas as pd
idx_level_0 = np.repeat([1, 2], 5)
dates = [
"2020-01-01",
"2020-01-02",
"2020-01-03",
"2020-01-04",
"2020-01-05",
"2020-03-01",
"2020-03-08",
"2020-03-15",
"2020-03-22",
"2020-03-29",
]
values1 = [1, 2, 3, 4, 5]
values2 = [6, 7, 8, 9, 10]
df = pd.DataFrame(
{"idx_level_0": idx_level_0, "dates": dates, "values": [*values1, *values2]}
)
df["dates"] = pd.to_datetime(df["dates"])
df = df.set_index(["idx_level_0", "dates"], drop=True)
# df.loc[1].index.freq should yield "D"
# df.loc[1].index.freq should yield "W" |
Great, I think we're all agreed on the point that |
using code sample in #35563 (comment) points to #31315 c988567 is the first bad commit
|
moved off 1.1.2 milestone (scheduled for this week) as no PRs to fix in the pipeline |
moved off 1.1.3 milestone (overdue) as no PRs to fix in the pipeline |
moved off 1.1.4 milestone (scheduled for release tomorrow) as no PRs to fix in the pipeline |
it appears that the frequency is lost is in the multiindex construction...
it is actually lost in the the since there have been several refactors to _shallow_copy/_simple_new since, i'm not sure that we want to (or can easily) revert those changes so it may be easier to ensure that the freq is retained in the factorisation step. @jbrockmendel thoughts?
|
Retaining the freq through factorize makes sense to me... IIUC, if the input index had a freq, the output index will have an identical |
That's the crux of #33830. I think retaining freq makes sense. |
@jbrockmendel do you have time to look into this? I won't be investigating further today. |
strong maybe |
@jbrockmendel I have a question losely related to the ops question:
The first one does not retain a freq while the second one does. This seems odd to me, because both are doing exactly the same from a user facing perspective. If we get a listlike indexer, which is essentially equal to a slice, should we retain the freq then? Could implement this relatively easy in the part where we are inferring the freq for slice like indexers |
The difference is that when we do You're right it wouldn't be that hard to do a |
Thx for the explanation. Do not feel strongly about it, just stumbled across when working on #27180 and wondered about the different behavior. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Update/Summary
When dealing with a pd.multiindex the frequency
df.loc[i].index.freq
has not been set. This behavior has been introduced in v1.1.0Original post
Code Sample, a copy-pastable example
Problem description
When resampling a
groupby
-object, the frequency will incorrectly be set toNone
.Expected Output
The frequency should be set according to the resampled frequency.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : d9fff27
python : 3.8.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : de_DE.cp1252
pandas : 1.1.0
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 45.3.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: