Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representing a timezone value in Substrait #190

Closed
rtpsw opened this issue May 10, 2022 · 28 comments
Closed

Representing a timezone value in Substrait #190

rtpsw opened this issue May 10, 2022 · 28 comments

Comments

@rtpsw
Copy link

rtpsw commented May 10, 2022

How should a timezone value be represented within a Substrait plan? A use case I have in mind for a timezone value in Substrait is casting a string-typed field, which could have values like 2/2/2022 2:20:22, to a timestamp-typed field; this conversion requires a timezone value to be meaningful. I'd propose that changing the Substrait spec of TimezoneTZ to allow a tz attribute. Currently, the timezone interpretation of a TimezoneTZ is implicit, and at least in Arrow it defaults to UTC.

Note that this is a follow up on the discussion here.

@rtpsw
Copy link
Author

rtpsw commented May 10, 2022

cc @icexelloss

@jacques-n
Copy link
Contributor

I don't think the plan should actually carry a timezone concept. It seems like this is really a function. For, something like:

as_date(<value:string>, <timezone:string>)

OR

`as_date(value:string, timeszone:enum[GMT,...])

While a particular language may treat timezone as data type, I think we can handle via either an enum or string in Substrait. Are there any cases where that is insufficient?

@jacques-n
Copy link
Contributor

jacques-n commented May 11, 2022

For reference to your other questions @cpcloud, the spec doesn't specify the specific physical representation because that is encoding specific. It simply specifies that the value is an instance on the timeline. The protobuf physical representation does call out the representation:

Timestamp in units of microseconds since the UNIX epoch. reference link

@westonpace
Copy link
Member

I also prefer @jacques-n suggestion. I'm not a big fan of what Arrow does here and it is the only system that does something like this.

@rtpsw
Copy link
Author

rtpsw commented May 11, 2022

I don't think the plan should actually carry a timezone concept. It seems like this is really a function. For, something like:

as_date(<value:string>, <timezone:string>)

OR

`as_date(value:string, timeszone:enum[GMT,...])

While a particular language may treat timezone as data type, I think we can handle via either an enum or string in Substrait. Are there any cases where that is insufficient?

I don't know what the right approach is, but I suspect this approach is not complete. It seems to push the issue to Substrait consumers, which would need to define a mapping for as_date. My understanding is that if the as_date function is not standardized then a Substrait plan including it would not be consumer-agnostic, and if the as_date function is standardized then the specification of timezone values must also be.

@jvanstraten
Copy link
Contributor

IMO the most correct solution would be for a cast from a string to a timestamp_tz to assume UTC if the timezone is not specified in the string, or convert to UTC if it is. A cast from a string to a timestamp should ignore any provided timestamp, I guess. For your use case I would argue that it should be a cast to timestamp, followed by a function with signature (timestamp, string) -> timestamp_tz that does timezone conversion, and/or one with the signature (required-enum, timestamp) -> timestamp_tz for when the timezone is statically known. The inverse operations should then also exist. For numeric offsets and when DST shenanigans aren't necessary or are already resolved you could also just add/subtract an interval representing the difference between the timezones.

However, I generally dislike trying to abstract all this stuff into a generic cast operation that no one has yet dared to specify. What if I cast "two days ago" to a timestamp, or "twenty-three" to an integer? It should probably tell me to pound sand, but it's not like "1/2/2022 2:20:22" is unambiguous without a format string either. I'd suppose you'd be talking about February 1st, but it could also be January 2nd, and that you mean 02:20:22 rather than 14:20:22 as there is no AM/PM indicator. So, if I were to extend your suggestion to this, shouldn't each timestamp also include a format string? The main issue here is that timestamps are a mess in general and that dumping this conversion on a cast operation that isn't clearly specified would be result in a situation where you only think you have consumer-agnostic plans, which is way worse than a plan that uses some extension that isn't well-known. Therefore, IMO, everything that isn't abundantly obvious and/or completely nailed down in the Substrait spec should be an extension, timestamp/string conversions included (beyond ISO format or whatever anyway).

Note that there's nothing stating that a consumer is only allowed to understand its own extensions and/or whatever is in the Substrait repo. The same applies for producers. After all, if a producer only understands its own extensions, and a consumer only understands its own extensions, why have extensions at all? I don't know how those function extension standards will proliferate (maybe the set of well-known functions will grow rapidly, maybe someone will step up to make a converter between various extensions, maybe everyone will just have to try to implement everything else for whatever they want to communicate with), but complex stuff beyond the common core has to be pushed down to consumers and producers. If Substrait were to instead try to be a superset of the capabilities of every framework out there it would simply be infeasible for any mere mortal to specify or use. What is and isn't included in that "common core" is then up for discussion, but for this particular case, I would argue that requiring 1/2/2022 2:20:22 UTC to semantically differ from 1/2/2022 3:20:22 UTC+1 is niche and/or only needed because of implementation details, and therefore instead warrants the use of an extension type.

Currently, the timezone interpretation of a TimezoneTZ is implicit, and at least in Arrow it defaults to UTC.

The spec at least implies that it is supposed to be UTC-based, as it states that the Arrow analog for timestamp_tz is timestamp<micro;utc>. That's what I implemented in the validator, anyway. But the spec should be made more explicit, at least for timestamp_tz literals.

@jacques-n
Copy link
Contributor

jacques-n commented May 11, 2022

The spec at least implies that it is supposed to be UTC-based

It's a minor point but I think it is important to make. The spec only declares the logical meaning: a point on the timeline. This is intentional as there are many ways one could represent a point on the timeline. The encoding definitions are where the serialization is specified and it is: "Timestamp in units of microseconds since the UNIX epoch".

as_date function is not standardized then a Substrait plan including it would not be consumer-agnostic

As @jvanstraten mentioned, extensions are designed for producers and consumers to be able shared even if they aren't specified in core Substrait. I'd also note that this seems like a common enough pattern that we should add some functions in Substrait core to support some of these use cases.

For reference, I believe Oracle and Snowflake call this TO_DATE or similar.

@westonpace
Copy link
Member

I don't know what the right approach is, but I suspect this approach is not complete.

It's not really complete. The data author (avoiding the word producer as it is overloaded here) has no way of indicating what timezone a timestamp column relates to. This means the data reader will always know exactly what instant something occurred but wouldn't know if that event occurred in the evening or the morning. However, what we have is sufficient for a large majority of use cases. Most data authors do not record the timezone alongside a timestamp in the actual data files (e.g. parquet files) and forcing users to pick or infer something is misleading.

The problem with timezone in the timestamp data type

Putting the timezone in the data type (in Arrow) has been very confusing for users. For example, when converting from a timezone to a string this timezone is used. However, when a user is going from a timezone to a string the most common desire is to use the timezone of the data reader and NOT the data author. So using the timezone in the data type is wrong in this case. So in Arrow we have to cast to the timezone of the reader and then convert to string and this is just an awkward flow that is highly unintuitive and error-prone.

Alternative

If we want to support a complete specification of both timestamp and the timezone the timestamp was recorded in then I would argue highly for a data type that included the timezone as part of the value. In other words, each row in the array could have its own timezone. This can be approximated today with a struct column or with two separate columns.

@icexelloss
Copy link
Contributor

icexelloss commented May 11, 2022

The original request of adding tz info to timestamp_tz comes from the requirement to express the following query

Filter rows in a table to 10 AM New York time, while the original timestamp is in UTC

From a discussion from Arrow mailing list, we come up with following Arrow code:

  cp::Expression predicate = cp::call(
                                      "equal",
                                      {cp::call("cast",
                                                {cp::call("cast",
                                                          {cp::field_ref("time")},
                                                          cp::CastOptions::Safe(arrow::timestamp(arrow::TimeUnit::NANO, timezone)))},
                                                cp::CastOptions::Safe(arrow::time64(arrow::TimeUnit::NANO))),
                                       cp::literal(std::make_shared<arrow::Time64Scalar>(time, arrow::time64(arrow::TimeUnit::NANO)))}
                                      );

which basically:
(1) cast the timestamp from UTC to America/NewYork
(2) cast the timestamp to time64 to extract the hour/minute/second information in New York timezone
(3) compare with a time value provided (10 AM)

From the discussion, it appears to me the Arrow query might not be the correct way of doing it? I am not sure how the original query can be represented in Substrait and Arrow so I am curious to hear what are alternatives.

@westonpace thoughts?

(The request for adding tz to timestamp_tz is to perform the cast in step (1))

@westonpace
Copy link
Member

We're missing temporal functions in Substrait at the moment (maybe Rok will be able to lend his expertise, I can ask). The Substrait equivalent should be something like:

# Given a string column interpret the strings as UTC (must be specified if the strings don't
# already have a timezone specifier in them like 2022-05-11T17:00:12Z or
# 2022-05-11T17:00:12+00:00)
time_column = as_time("time", "UTC")
# Time literals in Substrait are always 64 bit integers from the epoch
time_literal = to_utc_timestamp("2022-05-11T10:00:00-04:00")
predicate = time_column < time_literal

This would just use basic temporal comparison (e.g. column < literal) which should be faster than extracting the individual pieces.

If you still wanted to extract the individual pieces they would be:

hour = get_hour(timestamp_column, "New York")

Whatever the Substrait plan ends up doing it will be Arrow's responsibility to figure out the equivalent Arrow plan.

@cpcloud
Copy link
Contributor

cpcloud commented May 11, 2022

a data type that included the timezone as part of the value

The tradeoff in complexity that supporting this adds is not worth it IMO:

  1. How would we spell the type of a column that has values with multiple possible time zones?
  2. Engines now have to operate per-value, or build complex code paths to operate on batches of elements that are known to be contiguously time-zoned.

This doesn't seem worth the effort at all to me.

@icexelloss
Copy link
Contributor

icexelloss commented May 11, 2022

This would just use basic temporal comparison (e.g. column < literal) which should be faster than extracting the individual pieces.

I am not sure how to use basic temporal comparison to express "Filter rows to 10 AM New York time" without extracting individual pieces. Can you give an example?

hour = get_hour(timestamp_column, "New York")

This is a one possibility. Although I felt it's slightly harder to use.

i.e.

get_hour(table['time'], 'America/New_York') == 10 and get_minute(table['time']) == 0 and get_seconds(table['time'] == 10)

comparing to

table['time'].cast(Timestamp(tz='America/New_York')).cast('time64') == datetime.time(10, 0, 0)

(which is similar to the pandas API that does this)

@westonpace
Copy link
Member

I'm not recommending we do that work right now. I'm just pointing out that the current data type does not capture all imaginable use cases and that is both ok and a very intentional decision.

@westonpace
Copy link
Member

westonpace commented May 11, 2022

I am not sure how to use basic temporal comparison to express "Filter rows to 10 AM New York time" without extracting individual pieces. Can you give an example?

In Arrow, you can use the less, less_equal, greater, and greater_equal functions:

import pyarrow as pa
import pyarrow.compute as pc

times_as_strings = ['2022-05-11T17:00:12+00:00', '2022-04-11T17:00:12+00:00', '2022-03-11T17:00:12+00:00']
times_as_timestamps = pc.strptime(times_as_strings, format='%Y-%m-%dT%H:%M:%S+00:00', unit='s')
# 1648771200 is 2022-04-01T00:00:00Z
end_point_literal = pa.scalar(1648771200, pa.timestamp('s'))

print(pc.less(times_as_timestamps, end_point_literal))
# [
#   false,
#   false,
#   true
# ]
print(pc.filter(times_as_timestamps, pc.less(times_as_timestamps, end_point_literal)))
# [
#   2022-03-11 17:00:12
# ]

@westonpace
Copy link
Member

Ah, rereading your comments I think I misunderstood your question. You are looking for all events that happened earlier than 10AM New York time regardless of the day? In that case temporal extraction is the correct answer. Still, you should be able to do this in Substrait:

get_hour(times, "New York")

@jvanstraten
Copy link
Contributor

The encoding definitions are where the serialization is specified and it is: "Timestamp in units of microseconds since the UNIX epoch".

Oops, my bad. I thought that the UNIX epoch was essentially a timestamp, i.e. with implicit timezone, but it's explicitly defined as UTC. That in turn makes the comment for timestamp confusing though, since it's defined in the same way. It wouldn't hurt to be more explicit here, but there is a "add lots of comments to the proto files" note on my mental todo list, anyway.

Rereading the spec, the supported range for timestamp_tz in the spec also explicitly specifies UTC; I missed that.

Rereading this thread too now I'm getting really confused about what the actual question was.

How should a timezone value be represented within a Substrait plan?

A timezone is either an hour/minute delta between -13:00 and +13:00 or a region string that includes DST shenanigans and thus only relates to said hour/minute delta by way of politics and the current date. So, it'd depend on which one you're after. For the former the software dev in me would use an i16 signifying minutes, but an interval_day would be the closest matching logical type currently defined by Substrait. For the latter it'd have to be some kind of string type. Ultimately, it'd be up to you.

A use case I have in mind for a timezone value in Substrait is casting a string-typed field, which could have values like 2/2/2022 2:20:22, to a timestamp-typed field;

Okay, so I see a function with two arguments, namely the timestamp to be converted and a format string, unless you'd really want to make the latter implicit and (ab)use the cast operation for it...

this conversion requires a timezone value to be meaningful.

Wait, no. This is only true if you're converting string representations of timestamps without timezones to a timestamp_tz. This is what timestamp is for. Converting said strings to a timestamp is not ambiguous, because both lack the timezone field; timestamp exactly represents how people normally think of and write timestamps. It's the conversion between timestamp and timestamp_tz that ultimately requires that a timezone is specified somehow, knowledge of local laws w.r.t. DST, etc. The timezone to timestamp_tz conversion in fact also needs to know whether DST is active or not, otherwise it's ambiguous in the hour after the clock is wound back.

I'd propose that changing the Substrait spec of TimezoneTZ to allow a tz attribute.

This is where I'm really getting lost. TimezoneTZ doesn't exist, so I think everyone's been assuming that you meant to say timestamp_tz. But I don't understand how that would relate to your question at all. Converting from string to timestamp_tz you'd need a timezone at the input, not at the output.

Currently, the timezone interpretation of a TimezoneTZ is implicit, and at least in Arrow it defaults to UTC.

It's an interesting question whether a timestamp + timezone pair would be compliant with Substrait's timestamp_tz definition; I'm actually not sure why it wouldn't be. I'd say it'd be consumer-specific behavior to semantically use values with a timestamp other than UTC, but if implementations wouldn't be allowed to store Substrait types by means of some physical type that is a superset thereof there'd be a lot more problems (especially with interval types). But I'm not sure how this relates to the rest of your question, at least as I interpreted it.

Can you point out where I'm going wrong in understanding your question? And is this thread starting to get really off-topic or am I just completely lost here?

@rtpsw
Copy link
Author

rtpsw commented May 12, 2022

@jvanstraten, let's see if I can clarify. I suspect a good part of the confusion is due to my Substrait-producer-to-consumer (Ibis->Substrait->Arrow) point of view, as opposed to a pure Substrait one.

So, it'd depend on which one you're after.

I use the region string interpretation.

I see a function with two arguments, namely the timestamp to be converted and a format string

Agreed. My starting point for thinking about this was Arrow's cast function. I naturally went to using an existing function and didn't consider defining an extension. Since Arrow's cast function takes a type argument, in this case arrow::timestamp(arrow::TimeUnit::NANO, timezone), and is parsed this way in Arrow-Substrait too, I wanted a way to encode this type in the Substrait plan, and this led me to the question here. In a local branch, I was able to get things to work by adding a tz member to TimestampTZ (in load-diff of proto/substrait/type.proto) and encoding it in the Substrait translation.

Wait, no.

You're right. I wrote conversion to a "timestamp" field when I meant "timestamp + timezone".

This is where I'm really getting lost. TimezoneTZ doesn't exist

I am referring to Substrait protobuf's TimestampTZ (in load-diff of proto/substrait/type.proto).

@jvanstraten
Copy link
Contributor

My starting point for thinking about this was Arrow's cast function. I naturally went to using an existing function and didn't consider defining an extension.

Okay, then part of the confusion here arises from the fact that Arrow's Substrait consumer wrongly treats casts as functions with the name cast as casts (that function name isn't even valid, and function overloads can't be distinguished by return type), rather than using Substrait's cast expression (though in fairness I'm not 100% sure this expression already existed back then). Also be aware that all functions in Substrait are fundamentally extensions, it's just that some of those extensions are defined by Substrait itself.

In general, having worked on the initial PR for the Arrow Substrait consumer: it's not only incomplete, there's a lot wrong with it, both due to not wanting to implement all the corner cases right off the bat and due misinterpretations of the Substrait spec. Please don't treat it as if it's set in stone. I only stopped working on it because essentially I got distracted trying to understand Substrait properly; it's now been months and I still don't feel like I've grokked it completely...

I am referring to Substrait protobuf's TimestampTZ (in load-diff of proto/substrait/type.proto).

So this is about adding a field to the type of timestamp_tz, rather than to values of this type. That is, a type parameter; something like timestamp_tz<"UTC">, making it what Substrait defines to be a compound type. A lot of the preceding discussion was assuming it'd be part of the values, I think.

I don't personally think we should make this change, especially for something as finicky as timezones. Half the Substrait spec would then have to be about which timezone identifiers are legal and how they should behave.

What I do think should be added to Substrait is a way to specify compound extension types, so Arrow and Ibis can properly define an extension for this. Without compound extension types, this would require defining a separate type (or type variation) for all timezones, which would get very annoying very fast. IIRC there were also types like this that have uncountably infinite parameterizations, which absolutely need parameterizable extension types (I don't remember which, but I'm probably thinking of unions). This is a big can of worms that deserves its own issue, though.

I propose we try to limit the discussion in this thread to the original question, and open separate issues for the other things that have come up, if needed.

@westonpace
Copy link
Member

Okay, then part of the confusion here arises from the fact that Arrow's Substrait consumer wrongly treats casts as functions with the name cast as casts (that function name isn't even valid, and function overloads can't be distinguished by return type), rather than using Substrait's cast expression (though in fairness I'm not 100% sure this expression already existed back then). Also be aware that all functions in Substrait are fundamentally extensions, it's just that some of those extensions are defined by Substrait itself.

Good catch. I've created https://issues.apache.org/jira/browse/ARROW-16550

I propose we try to limit the discussion in this thread to the original question, and open separate issues for the other things that have come up, if needed.

So the original question was to add a timezone to the type parameter. So let's look at a specific example.

Problem statement: Given a column of instants (TIMESTAMP_TZ), get the hour of the day in the America/New York time zone.

The "parameterized type" approach is:

get_hour(cast(field_ref("timestamps"), TIMESTAMP<"America/New York">))

The "parameterized function" approach is:

get_hour(field_ref("timestamps"), "America/New York")

So, in my understanding, this issue boils down to a choice between the above two approaches. I am a fan of "parameterized function" because I think it is more intuitive and, in my experiences with Arrow (which uses the parameterized type approach), it has been really hard for people to understand the purpose of the type parameter.

@icexelloss
Copy link
Contributor

icexelloss commented May 12, 2022

So, in my understanding, this issue boils down to a choice between the above two approaches.

I think your understanding is spot on. And it's probably a longer discussion on which one is better and they are not that different. Pandas/Python does the former so most Python users would be more familiar with that. I have my personal preference to be the former but that's because I am more used to Pandas API, and with latter you would need to add "timezone" param to potentially many functions, instead of just having that on the Timestamp type. Also the former is something that Arrow already supports Today and I am bit surprised to hear from Weston that this is not the way to go (not sure if that has consensus from Arrow side already). But I am not going to argue hard for it, both are workable IMO. But if the second is the way to go, I would suggest making this function get_time instead of get_hour

get_time(field_ref("timestamps"), "America/New York") == time(10, 0, 0)

@rtpsw
Copy link
Author

rtpsw commented May 12, 2022

@jvanstraten, thanks for this explanation. I still have a way to go myself to grok Ibis and Substrait.

Okay, then part of the confusion here arises from the fact that Arrow's Substrait consumer wrongly treats casts as functions with the name cast as casts (...), rather than using Substrait's cast expression

My understanding is that Ibis precedes Arrow in this handling of cast, because its ibis.expr.types.generic.AnyValue.cast function returns an instance of ibis.expr.operations.ValueOp, often by calling to_expr() on an instance of ibis.expr.operations.Cast, which is derived from ValueOp. This ValueOp gets translated in Ibis-Substrait to a ScalarFunction rather than a Cast expression, which forces Arrow to handle it this way too.

(that function name isn't even valid, and function overloads can't be distinguished by return type

You're right, though note that the existing code still works (not to say this is a good thing). I've seen Substrait happily create multiple extension-functions with the same cast name, because each instance has a different return type that is not expressed in the extension-function. And then Arrow happily infers the correct return-type of a cast occurrence by looking at the output-type of the Substrait expression in which the cast is used.

@jvanstraten
Copy link
Contributor

I've never seen a single line of code in Ibis, or even used it personally, so I didn't follow much of that. But it sounds to me like the Ibis producer is the problem, then.

You're right, though note that the existing code still works (not to say this is a good thing). [...]

You read my mind with that paragraph! That's what the validator is for :) Unfortunately it doesn't validate extensions and function calls yet, because type expressions have been a major headache for me thus far. It generates warnings to say it doesn't check them yet, though.

With respect to the function parameterization vs type parameterization discussion, I have no opinion either way.

@rtpsw
Copy link
Author

rtpsw commented May 12, 2022

That's what the validator is for :)

I wasn't aware of this validator. Side question: could you point to documentation about it?

@jvanstraten
Copy link
Contributor

I think I linked it in a previous comment, but here. It hasn't been merged yet (it's stable, but pending review), so that's why there's no mention on the website yet.

@westonpace
Copy link
Member

FYI, for context, the decision to encode time zones as separate options (and thus use the "parameterized function" approach) has some previous discussion here: #2 (comment)

@rok
Copy link
Contributor

rok commented May 13, 2022

I don't know much about Substrait at this point but I have implemented a bunch of temporal kernels in Arrow so I can speak from that experience.

If timestamps are stored in UTC then local timezone is needed when "viewing", e.g. extracting time components, strftime, days/months/..._between, floor/ceil/round, is_daylightsavings, is_leap_year. The example @icexelloss gives for hour/minute/second extraction in arrow kernel looks like this. UTC timestamp is converted to local epoch time and is rounded to a desired unit. Timezone is needed, but the source does not matter much.

On one hand timestamp type parametrisation in Arrow turns out to be unpractical because timezone needs to be extracted at runtime which is not great for templating kernels. On the other hand type parametrisation saves users some work and reduces error surface because timezone is lugged around with type. I don't feel strongly about either approach, but I really disliked templating complications around parametric type.

I don't see timestamp type and timezone as indivisible as long as timestamp in UTC. A UTC timestamp represents an unambiguous point on time continuum and will always exist. Local timestamps on the other hand would arguably have more use for timezone information as they have near zero meaning without it.

I can sketch out a PR for a subset temporal functions as they are currently in Arrow to help inform this discussion.

@rok
Copy link
Contributor

rok commented Jun 8, 2022

Sorry I didn't get to this sooner. I've opened a PR #222.

@westonpace
Copy link
Member

I believe this was discussed further, and ultimately addressed, in #222. There are some functions which take the time zone as an argument (e.g. extracting a year from a timestamp_tz). These functions accept the time zone as a string today and add that it must be a tzdb time zone string. I think it would be an interesting idea to create a new data type / string-variant purely for time zones :) but I'll leave that effort for future development.

rkondakov pushed a commit to rkondakov/substrait that referenced this issue Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants