Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "semester" as a time/date component to DatetimeIndex #22362

Open
Nemecsek opened this issue Aug 15, 2018 · 5 comments
Open

Add "semester" as a time/date component to DatetimeIndex #22362

Nemecsek opened this issue Aug 15, 2018 · 5 comments
Labels
Datetime Datetime data dtype Enhancement Frequency DateOffsets

Comments

@Nemecsek
Copy link

Nemecsek commented Aug 15, 2018

Groupby is missing "semester"

from datetime import *
import pandas as pd
import numpy as np

df = pd.DataFrame()

days = pd.date_range(start="2017-05-17", 
                     end="2017-11-29",
                    freq="1D")
df = pd.DataFrame({'DTIME': days, 'DATA': np.random.randint(50, high=80, size=len(days))})
df.set_index('DTIME', inplace=True)

grouped = df.groupby(pd.Grouper(freq='2QS'))  # group by 2 Quarters,  start
print("Groups date start:")
for dtime, group in grouped:
    print dtime
    # print(group)

returns groups based on the first date time index of the dataset, not on the year semesters that begin on January, 1st and July, 1st:

Groups date start:
2017-04-01 00:00:00    <=== this is because the first datetime index is in  May, 2017
2017-10-01 00:00:00

while I would expect:

Groups date start:
2017-01-01 00:00:00   
2017-07-01 00:00:00

This issue is difficult to spot, as the behaviour changes according to the dataset, while it should be consistent. I didn't spot it with my first dataset (starting on January).

The same problem will show when grouping by 6MS (six months, start)

Semester frequency is missing from Pandas'offset-aliases

@WillAyd
Copy link
Member

WillAyd commented Aug 15, 2018

I believe this is intended as there is nothing that requires a "2 quarter period" to start on January and July, respectively.

cc @jbrockmendel

@WillAyd WillAyd added the Datetime Datetime data dtype label Aug 15, 2018
@Nemecsek
Copy link
Author

@WillAyd, it could be intended, but it is confusing indeed.
The problem comes when intervals are not unitary: 2 months, 3 quarters, etc. If the interval is not unitary, groups are created based on the first date of the dataset:

Grouping by "2Q":

2017-04-01 00:00:00    <=== this is because the first datetime index is in  May, 2017
2017-10-01 00:00:00

This behaviour is dangerous because it changes according to the dataset.
Try to imagine financial usage. You have a dataset starting from January 1st and ending December 31st. Grouping looks OK, but just by chance because the first date is in January.

As soon as your first date is in May the 2Q grouping won't start anymore on January, 1st.

It is what happened to me: I was using "2Q" as a synonym for semester, and just changing the demo dataset groups went avoc...

Adding a further interval such as Z for semester would allow to use it as unitary and avoid this behaviour.

@jbrockmendel
Copy link
Member

I think something like this would be useful. @Nemecsek does the use case depend on "semester" corresponding to e.g. Sep-Dec/Jan-May, or would it be sufficient to have something like Half that corresponds to Jan-Jun/July-Dec? (the latter would be easier to implement)

Aside from that, the design question that comes to mind is whether we should modify Quarter offsets to be customizable, or implement new Half/Semester/Season offset classes.

PRs welcome. You'll want to look at pandas.tseries.offsets.

@jbrockmendel jbrockmendel added the Frequency DateOffsets label Aug 17, 2018
@Nemecsek
Copy link
Author

@jbrockmendel, I would implement the "Half" to keep the logic.
Don't forget the start and HS ("Half start"): I saw that in pandas 0.23.4 WS for "week start" is not a valid period and only W is.

@giuliobeseghi
Copy link

I agree with @Nemecsek. In addition:

this_month = pd.Timestamp('2019-9')
sem = pd.offsets.QuarterBegin(n=2, startingMonth=1)

print(this_month - sem)

gives
Timestamp('2019-04-01 00:00:00')
and not
Timestamp('2019-07-01 00:00:00')

This is expected (by the definition of QuarterBegin). Hopefully the "semester" offset will return the second option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement Frequency DateOffsets
Projects
None yet
Development

No branches or pull requests

5 participants