Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix read_csv to parse timezone correctly #22380

Merged
merged 2 commits into from
Aug 22, 2018

Conversation

swyoon
Copy link
Contributor

@swyoon swyoon commented Aug 16, 2018

Use box=True for to_datetime(), and adjust downstream processing to the change.

dayfirst=dayfirst,
errors='ignore',
infer_datetime_format=infer_datetime_format
)
if not isinstance(converted, DatetimeIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what breaks if you just change box-> True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

box=True will make a non-datetime array into an Index, resulting downstream errors. The downstream expects ndarray rather than Index

@jreback jreback added IO CSV read_csv, to_csv Timezones Timezone data dtype labels Aug 16, 2018
@jreback
Copy link
Contributor

jreback commented Aug 16, 2018

cc @gfyoung @mroeschke

@swyoon
Copy link
Contributor Author

swyoon commented Aug 16, 2018

pandas.io.parsers._infer_types breaks. Specifically, pandas._libs.parsers.sanitize_objects only accepts ndarray as the input. Otherwise, it raises an exception.

@jreback
Copy link
Contributor

jreback commented Aug 16, 2018

right so just put a
np.asarray right before sanitize (this doesn’t copy it already an ndarray)

and see what happens

@codecov
Copy link

codecov bot commented Aug 16, 2018

Codecov Report

Merging #22380 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #22380   +/-   ##
=======================================
  Coverage   92.05%   92.05%           
=======================================
  Files         169      169           
  Lines       50733    50733           
=======================================
  Hits        46702    46702           
  Misses       4031     4031
Flag Coverage Δ
#multiple 90.46% <100%> (ø) ⬆️
#single 42.24% <75%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.48% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f11d1a...339f836. Read the comment docs.

@swyoon
Copy link
Contributor Author

swyoon commented Aug 16, 2018

We if cast data into np.ndarray, it will lost the timezone information. ndarray can't contain timezone information. link That's why I had to bypass it.

@mroeschke
Copy link
Member

So this is a case I didn't account for when I recently fixed to_datetime parsings of offsets.

In [1]: data = ['2018-01-04 09:01:00+09:00', '2018-01-04 09:02:00+09:00']

# This output should probably behave like In[5]
In [3]: pd.to_datetime(data, box=False)
Out[3]:
array(['2018-01-04T00:01:00.000000000', '2018-01-04T00:02:00.000000000'],
      dtype='datetime64[ns]')

In [4]: data = ['2018-01-04 09:01:00+09:00', '2018-01-04 09:02:00+08:00']

In [5]: pd.to_datetime(data, box=False)
Out[5]:
array([datetime.datetime(2018, 1, 4, 9, 1, tzinfo=tzoffset(None, 32400)),
       datetime.datetime(2018, 1, 4, 9, 2, tzinfo=tzoffset(None, 28800))],
      dtype=object)

I don't have the patch off the top of my head, but what would happen if instead if the to_datetime call returned an array of datetime objects? If it returns the expected result, the fix would probably be in to_datetime instead of the parsing code.

@swyoon
Copy link
Contributor Author

swyoon commented Aug 17, 2018

@mroeschke Thinks for the idea! I don't want to modify to_datetime. Since the box=True means the return of to_datetime should be an index, so I think to_datetime should remain as is. I just want to update CSV parsing process.

I think I have come up with a solution, inspired by @mroeschke .

In [1]: data = ['2018-01-04 09:01:00+09:00', '2018-01-04 09:02:00+09:00']
In [2]: dti = pd.to_datetime(data, box=True)
In [3]: np.array(dti.tolist())
Out[3]:
array([Timestamp('2018-01-04 09:01:00+0900', tz='pytz.FixedOffset(540)'),
       Timestamp('2018-01-04 09:02:00+0900', tz='pytz.FixedOffset(540)')],
      dtype=object)

The whole problem was caused by numpy.ndarray with datetime64 dtype not being able to contain timezone information. However, numpy.ndarray with object dtype can contain pandas Timestamps. This makes the PR much simpler.

@jreback
Copy link
Contributor

jreback commented Aug 17, 2018

@swyoon object arrays of Timestamps are quite non performant
this is why we use DTI in the first place

@swyoon swyoon force-pushed the gh-22256 branch 5 times, most recently from b0430cb to 45c5432 Compare August 19, 2018 05:17
@swyoon
Copy link
Contributor Author

swyoon commented Aug 19, 2018

@jreback I have applied np.asarray as you requested. It works well :) All checks are green.

@@ -674,3 +674,19 @@ def test_parse_date_float(self, data, expected, parse_dates):
# (i.e. float precision should remain unchanged).
result = self.read_csv(StringIO(data), parse_dates=parse_dates)
tm.assert_frame_equal(result, expected)

def test_parse_timezone(self):
import pytz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move import to top of file.

@gfyoung
Copy link
Member

gfyoung commented Aug 19, 2018

@swyoon : Can you make the changes in #22422 (as a separate commit)? They should work with your PR.

@swyoon swyoon force-pushed the gh-22256 branch 3 times, most recently from a3e4f7b to 52aa9b4 Compare August 20, 2018 04:00
@swyoon
Copy link
Contributor Author

swyoon commented Aug 20, 2018

@gfyoung Added a commit to fix #22422. It worked fine.
Travis seems to fail occasionally, and it doesn't seem to be related to my PR. Could you take a look?

@gfyoung
Copy link
Member

gfyoung commented Aug 20, 2018

Yeah...Travis has been acting up a bit. Restarted the failing builds, as they didn't look related to this PR.

@swyoon
Copy link
Contributor Author

swyoon commented Aug 20, 2018

@gfyoung Thank, but it failed again... failed job Could you run it one more time, please?

@swyoon
Copy link
Contributor Author

swyoon commented Aug 20, 2018

@gfyoung Sorry, but travis still doesn't work

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, lgtm. ping on green.

@@ -674,3 +666,18 @@ def test_parse_date_float(self, data, expected, parse_dates):
# (i.e. float precision should remain unchanged).
result = self.read_csv(StringIO(data), parse_dates=parse_dates)
tm.assert_frame_equal(result, expected)

def test_parse_timezone(self):
data = """dt,val
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment

@jreback jreback added this to the 0.24.0 milestone Aug 20, 2018
@jreback
Copy link
Contributor

jreback commented Aug 20, 2018

I think we also might have another open issue about this (aside from the referenced one), if you'd do a search and see.

@swyoon swyoon force-pushed the gh-22256 branch 2 times, most recently from 2f7e9c3 to 91ef33e Compare August 20, 2018 11:10
- use box=True for to_datetime(), and adjust downstream processing to
the change.
- resolve pandas-dev#22256
@swyoon
Copy link
Contributor Author

swyoon commented Aug 21, 2018

@jreback @gfyoung Finally, it's green. This build took so long. I will check out open issues that might be relevant to this PR.

@swyoon
Copy link
Contributor Author

swyoon commented Aug 21, 2018

@jreback @gfyoung could you please take a look?

Copy link
Member

@gfyoung gfyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

cc @jreback

@swyoon
Copy link
Contributor Author

swyoon commented Aug 22, 2018

@jreback is there anything needs to be done before merging?

@jreback jreback merged commit 45c0b5f into pandas-dev:master Aug 22, 2018
@jreback
Copy link
Contributor

jreback commented Aug 22, 2018

thanks @swyoon

@swyoon
Copy link
Contributor Author

swyoon commented Aug 22, 2018

@jreback @gfyoung @mroeschke Thank you all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG or DOC: pd.read_csv with parse_dates does not recognize timezone
4 participants