Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix origin epoch when freq is Day and harmonize epoch between timezones #34474

Merged
merged 2 commits into from
Jun 1, 2020

Conversation

hasB4K
Copy link
Member

@hasB4K hasB4K commented May 30, 2020

Follow-up of #31809.

The purpose of this PR is to fix the current behavior on master:

import pandas as pd
import numpy as np

start, end = "2000-08-02 23:30:00+0500", "2000-12-02 00:30:00+0500"
rng = pd.date_range(start, end, freq="7min")
ts = pd.Series(np.random.randn(len(rng)), index=rng)

result_1 = ts.resample("D", origin="epoch").count()
result_2 = ts.resample("24H", origin="epoch").count()

print(f"result_1:\n\n{result_1}\n")
print(f"result_2:\n\n{result_2}\n")

Outputs on master:

result_1:

2000-08-02 00:00:00+05:00      5
2000-08-03 00:00:00+05:00    205
2000-08-04 00:00:00+05:00    206
2000-08-05 00:00:00+05:00    206
2000-08-06 00:00:00+05:00    206
                            ... 
2000-11-28 00:00:00+05:00    206
2000-11-29 00:00:00+05:00    206
2000-11-30 00:00:00+05:00    205
2000-12-01 00:00:00+05:00    206
2000-12-02 00:00:00+05:00      5
Freq: D, Length: 123, dtype: int64

result_2:

2000-08-02 05:00:00+05:00     48
2000-08-03 05:00:00+05:00    205
2000-08-04 05:00:00+05:00    206
2000-08-05 05:00:00+05:00    206
2000-08-06 05:00:00+05:00    205
                            ... 
2000-11-27 05:00:00+05:00    206
2000-11-28 05:00:00+05:00    206
2000-11-29 05:00:00+05:00    206
2000-11-30 05:00:00+05:00    205
2000-12-01 05:00:00+05:00    168
Freq: 24H, Length: 122, dtype: int64

Expected Outputs (with this PR):

result_1:

2000-08-02 00:00:00+05:00      5
2000-08-03 00:00:00+05:00    205
2000-08-04 00:00:00+05:00    206
2000-08-05 00:00:00+05:00    206
2000-08-06 00:00:00+05:00    206
                            ... 
2000-11-28 00:00:00+05:00    206
2000-11-29 00:00:00+05:00    206
2000-11-30 00:00:00+05:00    205
2000-12-01 00:00:00+05:00    206
2000-12-02 00:00:00+05:00      5
Freq: D, Length: 123, dtype: int64

result_2:

2000-08-02 00:00:00+05:00      5
2000-08-03 00:00:00+05:00    205
2000-08-04 00:00:00+05:00    206
2000-08-05 00:00:00+05:00    206
2000-08-06 00:00:00+05:00    206
                            ... 
2000-11-28 00:00:00+05:00    206
2000-11-29 00:00:00+05:00    206
2000-11-30 00:00:00+05:00    205
2000-12-01 00:00:00+05:00    206
2000-12-02 00:00:00+05:00      5
Freq: 24H, Length: 123, dtype: int64

I thought of two possible solutions while fixing that:

  1. Consider 'epoch' as a UNIX epoch: pd.Timestamp(0, tz=index_tz)
  2. Consider that 'epoch' is just an helper to fix the origin: pd.Timestamp("1970-01-01", tz=index_tz)

To have similar results between timezones, I thought the solution 2 was a better choice. The solution 2 could also help us to set the behavior origin=epoch as the default one in a future version since the results would be quite similar with origin=start_day (that is not quite the case with a DST timezone, I can provide further details if needed on why this is the case).

(The solution 1 is correct in theory but is giving results that could be hard to explain to the end user.)


  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry (follow-up PR, not necessary I think)

@hasB4K hasB4K added API - Consistency Internal Consistency of API/Behavior Resample resample method labels May 30, 2020
@hasB4K hasB4K requested review from WillAyd and jreback May 30, 2020 00:45
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @mroeschke if you’d have a quick look

@jreback jreback added this to the 1.1 milestone Jun 1, 2020
pandas/core/resample.py Outdated Show resolved Hide resolved
@hasB4K
Copy link
Member Author

hasB4K commented Jun 1, 2020

It should be good now @jreback 😉. I had an issue with the CI, but #34509 has been merged since.

if isinstance(freq, Tick):
index_tz = first.tz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this actually be close to where you set the origin (e.g. in the TimeGrouper)?

Copy link
Member Author

@hasB4K hasB4K Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have access to the variables first or end to infer the index timezone in TimeGrouper.__init__ so I don't think I can do what you suggest. 😕

We used to infer the index_tz exactly on the same place before #31809 as you can see here on the tag v.1.0.4:

tz = first.tz

In my opinion that make sense since the API of groupby (shared with the resample API) can accept any kind of grouper, so the bins (that needs the information of first and end) can only be calculated when we apply the grouper against a Series or a DataFrame.

import pandas as pd
import numpy as np

start, end = "2000-08-02 23:30:00+0500", "2000-12-02 00:30:00+0500"
rng = pd.date_range(start, end, freq="7min")
ts_1 = pd.Series(np.random.randn(len(rng)), index=rng)
ts_2 = pd.Series(np.random.randn(len(rng)), index=rng)

# => The example line that create this impossibility:
grouper = pd.Grouper(freq="D", origin="epoch") 
result_1 = ts_1.groupby(grouper).mean()
result_2 = ts_2.groupby(grouper).mean()

@jreback jreback merged commit 3970153 into pandas-dev:master Jun 1, 2020
@jreback
Copy link
Contributor

jreback commented Jun 1, 2020

thanks @hasB4K very nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Resample resample method
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants