Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporal matcher can't handle timezone information in DataFrame indices #150

Closed
awst-baum opened this issue Aug 28, 2018 · 22 comments
Closed
Assignees

Comments

@awst-baum
Copy link
Collaborator

awst-baum commented Aug 28, 2018

The test that reproduces the issue:

def test_timezone_handling():
    data = np.arange(5.0)
    data[3] = np.nan

    match_df = pd.DataFrame({"data": data}, index=pd.date_range(datetime(2007, 1, 1, 0),
                                                            "2007-01-05", freq="D", tz="UTC"))
    timezone = pytz.timezone("UTC")
    ref_df = pd.DataFrame({"matched_data": np.arange(5)},
                            index=[timezone.localize(datetime(2007, 1, 1, 9)),
                                timezone.localize(datetime(2007, 1, 2, 9)),
                                timezone.localize(datetime(2007, 1, 3, 9)),
                                timezone.localize(datetime(2007, 1, 4, 9)),
                                timezone.localize(datetime(2007, 1, 5, 9))])
    matched = tmatching.matching(ref_df, match_df)

    nptest.assert_allclose(np.array([0, 1, 2, 4]), matched.matched_data)
    assert len(matched) == 4

The exception that occurs:

tests/test_temporal_matching.py:61: in test_timezone_handling
    matched = tmatching.matching(ref_df, match_df)
pytesmo/temporal_matching.py:157: in matching
    matched_data = matched_data.join(match)
.../python2.7/site-packages/pandas/core/frame.py:5316: in join
    rsuffix=rsuffix, sort=sort)
.../python2.7/site-packages/pandas/core/frame.py:5331: in _join_compat
    suffixes=(lsuffix, rsuffix), sort=sort)
.../python2.7/site-packages/pandas/core/reshape/merge.py:58: in merge
    return op.get_result()
.../python2.7/site-packages/pandas/core/reshape/merge.py:582: in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
.../python2.7/site-packages/pandas/core/reshape/merge.py:736: in _get_join_info
    sort=self.sort)
.../python2.7/site-packages/pandas/core/indexes/datetimes.py:1075: in join
    this, other = self._maybe_utc_convert(other)
.../python2.7/site-packages/pandas/core/indexes/datetimes.py:1084: in _maybe_utc_convert
    raise TypeError('Cannot join tz-naive with tz-aware '
E   TypeError: Cannot join tz-naive with tz-aware DatetimeIndex

The root of the problem: pytesmo.temporal_matching.df_match, line 85:

'ref_index': reference.index.values

If you set a breakpoint to line 86, you can observe this:

>>> print(arg.index.dtype)
datetime64[ns, UTC]
>>> print(arg_matched['ref_index'].dtype)
datetime64[ns]
>>> print(reference.index.dtype)
datetime64[ns, UTC]

Any recommendations on how to fix this? Manually copying over the timezone information or taking care not to delete it in the first place?

See also: https://github.com/awst-austria/pytesmo/blob/timezone_issue/tests/test_temporal_matching.py

@awst-baum
Copy link
Collaborator Author

Hm, looks like there's another point where timezone information is discarded: pytesmo.temporal_matching.df_match, line 115:

arg_matched = arg_matched.ix[min_dists]

@cpaulik
Copy link
Collaborator

cpaulik commented Aug 28, 2018

I've only taken a brief look but I think that both cases should work. Could it be that we are looking at pandas problems?

@awst-baum
Copy link
Collaborator Author

It looks like problem comes from pandas index manipulation, so maybe. I'll see if I can do some tests with different pandas versions.

@awst-baum
Copy link
Collaborator Author

awst-baum commented Aug 29, 2018

I just wrote and ran a script that installs pandas versions 0.19.2, 0.20.1, 0.21.1, 0.22.0, 0.23.0, 0.23.1, 0.23.2, and 0.23.4 (with conda) and then runs the unit test above.
It works with none of them on my machine. Hence, I've created a pull request for the unit test, if only to see if the test fails on Travis.

Edit: It does. I'm so glad I'm not seeing imaginary problems...

@cpaulik
Copy link
Collaborator

cpaulik commented Sep 3, 2018

This seems to be pandas-dev/pandas#15750

Also easily reproduced in your test by looking at the ref_df:

>> ref_df.index
DatetimeIndex(['2007-01-01 09:00:00+00:00', '2007-01-02 09:00:00+00:00',
               '2007-01-03 09:00:00+00:00', '2007-01-04 09:00:00+00:00',
               '2007-01-05 09:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)
>> ref_df.index.values
array(['2007-01-01T09:00:00.000000000', '2007-01-02T09:00:00.000000000',
       '2007-01-03T09:00:00.000000000', '2007-01-04T09:00:00.000000000',
       '2007-01-05T09:00:00.000000000'], dtype='datetime64[ns]')

The question for me is do we need timezone support?

If yes then we should fix it in our temporal matching routine by either using UTC internally or using julian dates internally.

@cpaulik
Copy link
Collaborator

cpaulik commented Sep 3, 2018

And also pandas-dev/pandas#14052

@awst-baum
Copy link
Collaborator Author

The question for me is do we need timezone support?

As I understood Tracy, it's needed for TU's current work - I hope she can explain it here herself.

Whatever the matcher uses internally, we need to make sure that the data returned has the same format as the data put into the matcher.

@cpaulik
Copy link
Collaborator

cpaulik commented Sep 3, 2018

@tracyscanlon Could you tell us about the need for timezone support? What about summer time and all the other ugly things that come with it? If we really need all of that then we should also check the library that pandas uses for all of this to see if everything we need is supported.

@awst-baum
Copy link
Collaborator Author

OK, discussion in our review meeting revealed that we currently don't expect data to be in non-UTC format.

What currently happens in my code using pytesmo is: if data has tz info, convert to UTC and then drop tz info.
I could move that code into pytesmo if that's deemed acceptable.

@cpaulik
Copy link
Collaborator

cpaulik commented Sep 21, 2018

I think converting to UTC, raising a Warning stating the fact that this was done and then dropping the tz info is a good approach.

@awst-baum
Copy link
Collaborator Author

Since I just ran into it with the AscatNc class: what do we do if read/read_ts don't return a DataFrame but an object that has-a DataFrame, like AscatTimeSeries?

@cpaulik
Copy link
Collaborator

cpaulik commented Nov 22, 2018

Then it does not fit our interface and we either write a simple adapter class or we change AscatNc.

@cpaulik
Copy link
Collaborator

cpaulik commented Nov 22, 2018

We can of course also just check the type and let our interface be a little smarter. But we should not let that get out of hand

@sebhahn
Copy link
Member

sebhahn commented Nov 22, 2018

Maybe we can do it the other way around and move the metadata currently provided by AscatNc into the pandas.DataFrame?

@cpaulik
Copy link
Collaborator

cpaulik commented Nov 22, 2018

Not sure this is so good if pandas does not have support for it. See pandas-dev/pandas#2485

We can extend the interface to support objects that have a data or df attribute if needed.

@awst-baum
Copy link
Collaborator Author

So validation_framework/data_manager.py:254 does this:

if type(data_df) is TS or issubclass(type(data_df), TS):

Does this mean that pytesmo doesn't use anything from the TS attributes but .data? And does it mean I can do the same reduction to .data in my adapter class(es) without breaking anything?

@cpaulik
Copy link
Collaborator

cpaulik commented Nov 22, 2018

Actually. It seem we already support it. So you should not need to do anything? What was your original error?

Maybe we should also open a separate issue, since this issue is about the timezones

@awst-baum
Copy link
Collaborator Author

Well, solving the timezone issue requires touching the dataframe somewhere in pytesmo. My question is: is it safe to either modify the TS or discard anything but the .data property. So it's more about how to implement the timezone solution.

@cpaulik
Copy link
Collaborator

cpaulik commented Nov 22, 2018

Since we handle TS object correctly either solution should work for changing the timezone before the validation framework gets the data.

A solution for the timezone issue could also be implemented in the data manager after we have extracted the pandas.DataFrame from our supported input formats.

@sebhahn
Copy link
Member

sebhahn commented Apr 10, 2019

@cpaulik I've merged PR #151, but probably the test should be re-written that it is OK this particular exception occurs and if the time zone handling is implemented it fails?

@cpaulik
Copy link
Collaborator

cpaulik commented Apr 16, 2019

After re-reading this thread, I don't think we need to do anything at the moment. Merging #151 broke our CI so I would revert this if nobody disagrees.

I don't see the point of having a test for a feature that we do not support at the moment.

@sebhahn
Copy link
Member

sebhahn commented Apr 17, 2019

I've reverted the merge d5ab76e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants