Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] PyArrow cast Unable to cast strings without Zone Offset #41268

Closed
WillAyd opened this issue Apr 17, 2024 · 2 comments
Closed

[Python] PyArrow cast Unable to cast strings without Zone Offset #41268

WillAyd opened this issue Apr 17, 2024 · 2 comments
Labels
Component: Python Type: usage Issue is a user question

Comments

@WillAyd
Copy link
Contributor

WillAyd commented Apr 17, 2024

Describe the bug, including details regarding any error messages, version, and platform.

>>> pa.array(["2024-01-01 05:00:00"]).cast(pa.timestamp("s"))
<pyarrow.lib.TimestampArray object at 0x7c6c794463e0>
[
  2024-01-01 05:00:00
]

>>> pa.array([datetime.datetime(2024, 1, 1, 5, 0, 0)]).cast(pa.timestamp("s"))
<pyarrow.lib.TimestampArray object at 0x7c6c79444b80>
[
  2024-01-01 05:00:00
]

>>> pa.array([datetime.datetime(2024, 1, 1, 5, 0, 0)]).cast(pa.timestamp("s", "UTC"))
<pyarrow.lib.TimestampArray object at 0x7c6c7956d720>
[
  2024-01-01 05:00:00Z
]

>>> pa.array(["2024-01-01 05:00:00"]).cast(pa.timestamp("s", "UTC"))
ArrowInvalid: Failed to parse string: '2024-01-01 05:00:00' as a scalar of type timestamp[s, tz=UTC]: expected a zone offset. If these timestamps are in local time, cast to timestamp without timezone, then call assume_timezone.

I'm having a hard time figuring out the best way to construct timezone aware arrays from strings instead of Python datetime objects. Based off the pattern set by the first 3 examples above it seems like a bug that the fourth does not work?

Component(s)

Python

@amoeba amoeba changed the title PyArrow cast Unable to cast strings without Zone Offset [Python] PyArrow cast Unable to cast strings without Zone Offset Apr 17, 2024
@jorisvandenbossche jorisvandenbossche added Type: usage Issue is a user question and removed Type: bug labels Apr 18, 2024
@jorisvandenbossche
Copy link
Member

Casting string to timestamp is essentially parsing of the string (strptime), and for that we currently don't allow to parse to a non-tz-aware string to a tz-aware timestamp (for that you would need to guess if the string is in local wall time or in UTC, i.e. is it a tz localize or a tz convert operation, in pandas terms).

The other examples you give are parsing a non-tz-aware string to a non-tz-aware timestamp (no ambiguity, this works fine) and casting non-tz-aware timestamp to tz-aware timestamp. This last case is also potentially ambiguous, but the casting here is a very simple zero-copy cast that essentially just changes the metadata of the timestamp type (to add a timezone), and thus essentially treats the input as UTC (and not local wall time, for which there is a specific kernel pc.assume_timezone).

And so parsing a non-tz-aware string to a tz-aware timestamp can always be done in two steps, first parsing / casting to timestamp, and then converting to tz-aware timestamp:

>>> pa.array(["2024-01-01 05:00:00"]).cast(pa.timestamp("s")).cast(pa.timestamp("s", "Europe/Brussels"))
<pyarrow.lib.TimestampArray object at 0x7f065c331960>
[
  2024-01-01 05:00:00Z
]
>>> pc.assume_timezone(pa.array(["2024-01-01 05:00:00"]).cast(pa.timestamp("s")), "Europe/Brussels")
<pyarrow.lib.TimestampArray object at 0x7f065c2d26e0>
[
  2024-01-01 04:00:00Z
]

@WillAyd
Copy link
Contributor Author

WillAyd commented Apr 18, 2024

Makes sense - thanks for the explanation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

2 participants