Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix origin epoch when freq is Day and harmonize epoch between timezones #34474

Merged
merged 2 commits into from
Jun 1, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 8 additions & 4 deletions pandas/core/resample.py
Original file line number Diff line number Diff line change
Expand Up @@ -1693,11 +1693,15 @@ def _get_timestamp_range_edges(
-------
A tuple of length 2, containing the adjusted pd.Timestamp objects.
"""
index_tz = first.tz
if isinstance(origin, Timestamp) and (origin.tz is None) != (index_tz is None):
raise ValueError("The origin must have the same timezone as the index.")

if isinstance(freq, Tick):
index_tz = first.tz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this actually be close to where you set the origin (e.g. in the TimeGrouper)?

Copy link
Member Author

@hasB4K hasB4K Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have access to the variables first or end to infer the index timezone in TimeGrouper.__init__ so I don't think I can do what you suggest. 😕

We used to infer the index_tz exactly on the same place before #31809 as you can see here on the tag v.1.0.4:

tz = first.tz

In my opinion that make sense since the API of groupby (shared with the resample API) can accept any kind of grouper, so the bins (that needs the information of first and end) can only be calculated when we apply the grouper against a Series or a DataFrame.

import pandas as pd
import numpy as np

start, end = "2000-08-02 23:30:00+0500", "2000-12-02 00:30:00+0500"
rng = pd.date_range(start, end, freq="7min")
ts_1 = pd.Series(np.random.randn(len(rng)), index=rng)
ts_2 = pd.Series(np.random.randn(len(rng)), index=rng)

# => The example line that create this impossibility:
grouper = pd.Grouper(freq="D", origin="epoch") 
result_1 = ts_1.groupby(grouper).mean()
result_2 = ts_2.groupby(grouper).mean()

if isinstance(origin, Timestamp) and (origin.tz is None) != (index_tz is None):
raise ValueError("The origin must have the same timezone as the index.")
elif origin == "epoch":
# set the epoch based on the timezone to have similar bins results when
# resampling on the same kind of indexes on different timezones
origin = Timestamp("1970-01-01", tz=index_tz)

if isinstance(freq, Day):
# _adjust_dates_anchored assumes 'D' means 24H, but first/last
# might contain a DST transition (23H, 24H, or 25H).
Expand Down
28 changes: 28 additions & 0 deletions pandas/tests/resample/test_datetime_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -846,6 +846,34 @@ def test_resample_origin_with_tz():
ts.resample("5min", origin="12/31/1999 23:57:00+03:00").mean()


def test_resample_origin_epoch_with_tz_day_vs_24h():
# GH 34474
start, end = "2000-10-01 23:30:00+0500", "2000-12-02 00:30:00+0500"
rng = pd.date_range(start, end, freq="7min")
random_values = np.random.randn(len(rng))
ts_1 = pd.Series(random_values, index=rng)

result_1 = ts_1.resample("D", origin="epoch").mean()
result_2 = ts_1.resample("24H", origin="epoch").mean()
tm.assert_series_equal(result_1, result_2)

# check that we have the same behavior with epoch even if we are not timezone aware
ts_no_tz = ts_1.tz_localize(None)
result_3 = ts_no_tz.resample("D", origin="epoch").mean()
result_4 = ts_no_tz.resample("24H", origin="epoch").mean()
tm.assert_series_equal(result_1, result_3.tz_localize(rng.tz), check_freq=False)
tm.assert_series_equal(result_1, result_4.tz_localize(rng.tz), check_freq=False)

# check that we have the similar results with two different timezones (+2H and +5H)
start, end = "2000-10-01 23:30:00+0200", "2000-12-02 00:30:00+0200"
rng = pd.date_range(start, end, freq="7min")
ts_2 = pd.Series(random_values, index=rng)
result_5 = ts_2.resample("D", origin="epoch").mean()
result_6 = ts_2.resample("24H", origin="epoch").mean()
tm.assert_series_equal(result_1.tz_localize(None), result_5.tz_localize(None))
tm.assert_series_equal(result_1.tz_localize(None), result_6.tz_localize(None))


def test_resample_origin_with_day_freq_on_dst():
# GH 31809
tz = "America/Chicago"
Expand Down