BUG: fix read_csv to parse timezone correctly #22380

swyoon · 2018-08-16T03:26:31Z

Use box=True for to_datetime(), and adjust downstream processing to the change.

closes BUG or DOC: pd.read_csv with parse_dates does not recognize timezone #22256
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback · 2018-08-16T10:23:55Z

pandas/io/parsers.py

                    dayfirst=dayfirst,
                    errors='ignore',
                    infer_datetime_format=infer_datetime_format
                )
+                if not isinstance(converted, DatetimeIndex):


what breaks if you just change box-> True?

box=True will make a non-datetime array into an Index, resulting downstream errors. The downstream expects ndarray rather than Index

jreback · 2018-08-16T10:24:20Z

cc @gfyoung @mroeschke

swyoon · 2018-08-16T12:08:26Z

pandas.io.parsers._infer_types breaks. Specifically, pandas._libs.parsers.sanitize_objects only accepts ndarray as the input. Otherwise, it raises an exception.

jreback · 2018-08-16T12:47:07Z

right so just put a
np.asarray right before sanitize (this doesn’t copy it already an ndarray)

and see what happens

codecov · 2018-08-16T14:04:19Z

Codecov Report

Merging #22380 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #22380   +/-   ##
=======================================
  Coverage   92.05%   92.05%           
=======================================
  Files         169      169           
  Lines       50733    50733           
=======================================
  Hits        46702    46702           
  Misses       4031     4031

Flag	Coverage Δ
#multiple	`90.46% <100%> (ø)`	⬆️
#single	`42.24% <75%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/io/parsers.py	`95.48% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f11d1a...339f836. Read the comment docs.

swyoon · 2018-08-16T15:27:41Z

We if cast data into np.ndarray, it will lost the timezone information. ndarray can't contain timezone information. link That's why I had to bypass it.

mroeschke · 2018-08-16T17:10:25Z

So this is a case I didn't account for when I recently fixed to_datetime parsings of offsets.

In [1]: data = ['2018-01-04 09:01:00+09:00', '2018-01-04 09:02:00+09:00']

# This output should probably behave like In[5]
In [3]: pd.to_datetime(data, box=False)
Out[3]:
array(['2018-01-04T00:01:00.000000000', '2018-01-04T00:02:00.000000000'],
      dtype='datetime64[ns]')

In [4]: data = ['2018-01-04 09:01:00+09:00', '2018-01-04 09:02:00+08:00']

In [5]: pd.to_datetime(data, box=False)
Out[5]:
array([datetime.datetime(2018, 1, 4, 9, 1, tzinfo=tzoffset(None, 32400)),
       datetime.datetime(2018, 1, 4, 9, 2, tzinfo=tzoffset(None, 28800))],
      dtype=object)

I don't have the patch off the top of my head, but what would happen if instead if the to_datetime call returned an array of datetime objects? If it returns the expected result, the fix would probably be in to_datetime instead of the parsing code.

swyoon · 2018-08-17T01:36:27Z

@mroeschke Thinks for the idea! I don't want to modify to_datetime. Since the box=True means the return of to_datetime should be an index, so I think to_datetime should remain as is. I just want to update CSV parsing process.

I think I have come up with a solution, inspired by @mroeschke .

In [1]: data = ['2018-01-04 09:01:00+09:00', '2018-01-04 09:02:00+09:00']
In [2]: dti = pd.to_datetime(data, box=True)
In [3]: np.array(dti.tolist())
Out[3]:
array([Timestamp('2018-01-04 09:01:00+0900', tz='pytz.FixedOffset(540)'),
       Timestamp('2018-01-04 09:02:00+0900', tz='pytz.FixedOffset(540)')],
      dtype=object)

The whole problem was caused by numpy.ndarray with datetime64 dtype not being able to contain timezone information. However, numpy.ndarray with object dtype can contain pandas Timestamps. This makes the PR much simpler.

jreback · 2018-08-17T01:46:40Z

@swyoon object arrays of Timestamps are quite non performant
this is why we use DTI in the first place

swyoon · 2018-08-19T09:29:24Z

@jreback I have applied np.asarray as you requested. It works well :) All checks are green.

gfyoung · 2018-08-19T09:34:02Z

pandas/tests/io/parser/parse_dates.py

@@ -674,3 +674,19 @@ def test_parse_date_float(self, data, expected, parse_dates):
        # (i.e. float precision should remain unchanged).
        result = self.read_csv(StringIO(data), parse_dates=parse_dates)
        tm.assert_frame_equal(result, expected)
+
+    def test_parse_timezone(self):
+        import pytz


Move import to top of file.

gfyoung · 2018-08-19T19:51:06Z

@swyoon : Can you make the changes in #22422 (as a separate commit)? They should work with your PR.

swyoon · 2018-08-20T04:41:23Z

@gfyoung Added a commit to fix #22422. It worked fine.
Travis seems to fail occasionally, and it doesn't seem to be related to my PR. Could you take a look?

gfyoung · 2018-08-20T05:13:41Z

Yeah...Travis has been acting up a bit. Restarted the failing builds, as they didn't look related to this PR.

swyoon · 2018-08-20T05:31:13Z

@gfyoung Thank, but it failed again... failed job Could you run it one more time, please?

swyoon · 2018-08-20T08:30:57Z

@gfyoung Sorry, but travis still doesn't work

jreback

small comment, lgtm. ping on green.

jreback · 2018-08-20T10:23:49Z

pandas/tests/io/parser/parse_dates.py

@@ -674,3 +666,18 @@ def test_parse_date_float(self, data, expected, parse_dates):
        # (i.e. float precision should remain unchanged).
        result = self.read_csv(StringIO(data), parse_dates=parse_dates)
        tm.assert_frame_equal(result, expected)
+
+    def test_parse_timezone(self):
+        data = """dt,val


can you add the issue number as a comment

jreback · 2018-08-20T10:25:34Z

I think we also might have another open issue about this (aside from the referenced one), if you'd do a search and see.

- use box=True for to_datetime(), and adjust downstream processing to the change. - resolve pandas-dev#22256

swyoon · 2018-08-21T07:27:56Z

@jreback @gfyoung Finally, it's green. This build took so long. I will check out open issues that might be relevant to this PR.

swyoon · 2018-08-21T14:33:46Z

@jreback @gfyoung could you please take a look?

gfyoung

LGTM!

cc @jreback

swyoon · 2018-08-22T10:20:30Z

@jreback is there anything needs to be done before merging?

jreback · 2018-08-22T10:27:19Z

thanks @swyoon

swyoon · 2018-08-22T10:28:13Z

@jreback @gfyoung @mroeschke Thank you all

jreback requested changes Aug 16, 2018

View reviewed changes

jreback added IO CSV read_csv, to_csv Timezones Timezone data dtype labels Aug 16, 2018

swyoon force-pushed the gh-22256 branch from aae7d6e to a1825b1 Compare August 17, 2018 01:26

swyoon force-pushed the gh-22256 branch 5 times, most recently from b0430cb to 45c5432 Compare August 19, 2018 05:17

gfyoung reviewed Aug 19, 2018

View reviewed changes

swyoon force-pushed the gh-22256 branch 2 times, most recently from 8ef5b1d to e4fc116 Compare August 19, 2018 14:30

gfyoung mentioned this pull request Aug 19, 2018

CLN: Remove try-except in parse_dates test #22422

Closed

swyoon force-pushed the gh-22256 branch 3 times, most recently from a3e4f7b to 52aa9b4 Compare August 20, 2018 04:00

swyoon force-pushed the gh-22256 branch from 52aa9b4 to bb12d32 Compare August 20, 2018 10:00

jreback requested changes Aug 20, 2018

View reviewed changes

jreback added this to the 0.24.0 milestone Aug 20, 2018

swyoon force-pushed the gh-22256 branch 2 times, most recently from 2f7e9c3 to 91ef33e Compare August 20, 2018 11:10

swyoon added 2 commits August 21, 2018 08:44

BUG: fix read_csv to parse timezone correctly

3a9c093

- use box=True for to_datetime(), and adjust downstream processing to the change. - resolve pandas-dev#22256

CLN: Remove try-except in parse_dates test

339f836

swyoon force-pushed the gh-22256 branch from 91ef33e to 339f836 Compare August 20, 2018 23:44

mroeschke mentioned this pull request Aug 21, 2018

BUG: to_datetime drops UTC offset when parsing datetime strings and box=False #22446

Closed

gfyoung approved these changes Aug 21, 2018

View reviewed changes

jreback approved these changes Aug 22, 2018

View reviewed changes

jreback merged commit 45c0b5f into pandas-dev:master Aug 22, 2018

swyoon deleted the gh-22256 branch August 22, 2018 10:28

mroeschke mentioned this pull request Aug 24, 2018

CLN: Simplify read_csv tz offset parsing #22494

Merged

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018

BUG: fix read_csv to parse timezone correctly (pandas-dev#22380)

8071730

TomAugspurger mentioned this pull request Jan 28, 2019

0.23.4 changed read_csv parsing for a mixed-timezone datetimes #24987

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: fix read_csv to parse timezone correctly #22380

BUG: fix read_csv to parse timezone correctly #22380

swyoon commented Aug 16, 2018 •

edited by gfyoung

Loading

jreback Aug 16, 2018

swyoon Aug 18, 2018

jreback commented Aug 16, 2018

swyoon commented Aug 16, 2018

jreback commented Aug 16, 2018

codecov bot commented Aug 16, 2018 •

edited

Loading

swyoon commented Aug 16, 2018

mroeschke commented Aug 16, 2018

swyoon commented Aug 17, 2018 •

edited

Loading

jreback commented Aug 17, 2018

swyoon commented Aug 19, 2018

gfyoung Aug 19, 2018

gfyoung commented Aug 19, 2018 •

edited

Loading

swyoon commented Aug 20, 2018

gfyoung commented Aug 20, 2018

swyoon commented Aug 20, 2018

swyoon commented Aug 20, 2018

jreback left a comment

jreback Aug 20, 2018

jreback commented Aug 20, 2018 •

edited

Loading

swyoon commented Aug 21, 2018

swyoon commented Aug 21, 2018

gfyoung left a comment

swyoon commented Aug 22, 2018

jreback commented Aug 22, 2018

swyoon commented Aug 22, 2018

BUG: fix read_csv to parse timezone correctly #22380

BUG: fix read_csv to parse timezone correctly #22380

Conversation

swyoon commented Aug 16, 2018 • edited by gfyoung Loading

jreback Aug 16, 2018

Choose a reason for hiding this comment

swyoon Aug 18, 2018

Choose a reason for hiding this comment

jreback commented Aug 16, 2018

swyoon commented Aug 16, 2018

jreback commented Aug 16, 2018

codecov bot commented Aug 16, 2018 • edited Loading

Codecov Report

swyoon commented Aug 16, 2018

mroeschke commented Aug 16, 2018

swyoon commented Aug 17, 2018 • edited Loading

jreback commented Aug 17, 2018

swyoon commented Aug 19, 2018

gfyoung Aug 19, 2018

Choose a reason for hiding this comment

gfyoung commented Aug 19, 2018 • edited Loading

swyoon commented Aug 20, 2018

gfyoung commented Aug 20, 2018

swyoon commented Aug 20, 2018

swyoon commented Aug 20, 2018

jreback left a comment

Choose a reason for hiding this comment

jreback Aug 20, 2018

Choose a reason for hiding this comment

jreback commented Aug 20, 2018 • edited Loading

swyoon commented Aug 21, 2018

swyoon commented Aug 21, 2018

gfyoung left a comment

Choose a reason for hiding this comment

swyoon commented Aug 22, 2018

jreback commented Aug 22, 2018

swyoon commented Aug 22, 2018

swyoon commented Aug 16, 2018 •

edited by gfyoung

Loading

codecov bot commented Aug 16, 2018 •

edited

Loading

swyoon commented Aug 17, 2018 •

edited

Loading

gfyoung commented Aug 19, 2018 •

edited

Loading

jreback commented Aug 20, 2018 •

edited

Loading