-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Representing a timezone value in Substrait #190
Comments
cc @icexelloss |
I don't think the plan should actually carry a timezone concept. It seems like this is really a function. For, something like:
OR `as_date(value:string, timeszone:enum[GMT,...]) While a particular language may treat timezone as data type, I think we can handle via either an enum or string in Substrait. Are there any cases where that is insufficient? |
For reference to your other questions @cpcloud, the spec doesn't specify the specific physical representation because that is encoding specific. It simply specifies that the value is an instance on the timeline. The protobuf physical representation does call out the representation:
|
I also prefer @jacques-n suggestion. I'm not a big fan of what Arrow does here and it is the only system that does something like this. |
I don't know what the right approach is, but I suspect this approach is not complete. It seems to push the issue to Substrait consumers, which would need to define a mapping for |
IMO the most correct solution would be for a cast from a string to a However, I generally dislike trying to abstract all this stuff into a generic cast operation that no one has yet dared to specify. What if I cast "two days ago" to a Note that there's nothing stating that a consumer is only allowed to understand its own extensions and/or whatever is in the Substrait repo. The same applies for producers. After all, if a producer only understands its own extensions, and a consumer only understands its own extensions, why have extensions at all? I don't know how those function extension standards will proliferate (maybe the set of well-known functions will grow rapidly, maybe someone will step up to make a converter between various extensions, maybe everyone will just have to try to implement everything else for whatever they want to communicate with), but complex stuff beyond the common core has to be pushed down to consumers and producers. If Substrait were to instead try to be a superset of the capabilities of every framework out there it would simply be infeasible for any mere mortal to specify or use. What is and isn't included in that "common core" is then up for discussion, but for this particular case, I would argue that requiring
The spec at least implies that it is supposed to be UTC-based, as it states that the Arrow analog for |
It's a minor point but I think it is important to make. The spec only declares the logical meaning: a point on the timeline. This is intentional as there are many ways one could represent a point on the timeline. The encoding definitions are where the serialization is specified and it is: "Timestamp in units of microseconds since the UNIX epoch".
As @jvanstraten mentioned, extensions are designed for producers and consumers to be able shared even if they aren't specified in core Substrait. I'd also note that this seems like a common enough pattern that we should add some functions in Substrait core to support some of these use cases. For reference, I believe Oracle and Snowflake call this |
It's not really complete. The data author (avoiding the word producer as it is overloaded here) has no way of indicating what timezone a timestamp column relates to. This means the data reader will always know exactly what instant something occurred but wouldn't know if that event occurred in the evening or the morning. However, what we have is sufficient for a large majority of use cases. Most data authors do not record the timezone alongside a timestamp in the actual data files (e.g. parquet files) and forcing users to pick or infer something is misleading. The problem with timezone in the timestamp data typePutting the timezone in the data type (in Arrow) has been very confusing for users. For example, when converting from a timezone to a string this timezone is used. However, when a user is going from a timezone to a string the most common desire is to use the timezone of the data reader and NOT the data author. So using the timezone in the data type is wrong in this case. So in Arrow we have to cast to the timezone of the reader and then convert to string and this is just an awkward flow that is highly unintuitive and error-prone. AlternativeIf we want to support a complete specification of both timestamp and the timezone the timestamp was recorded in then I would argue highly for a data type that included the timezone as part of the value. In other words, each row in the array could have its own timezone. This can be approximated today with a struct column or with two separate columns. |
The original request of adding tz info to timestamp_tz comes from the requirement to express the following query
From a discussion from Arrow mailing list, we come up with following Arrow code:
which basically: From the discussion, it appears to me the Arrow query might not be the correct way of doing it? I am not sure how the original query can be represented in Substrait and Arrow so I am curious to hear what are alternatives. @westonpace thoughts? (The request for adding tz to timestamp_tz is to perform the cast in step (1)) |
We're missing temporal functions in Substrait at the moment (maybe Rok will be able to lend his expertise, I can ask). The Substrait equivalent should be something like:
This would just use basic temporal comparison (e.g. column < literal) which should be faster than extracting the individual pieces. If you still wanted to extract the individual pieces they would be:
Whatever the Substrait plan ends up doing it will be Arrow's responsibility to figure out the equivalent Arrow plan. |
The tradeoff in complexity that supporting this adds is not worth it IMO:
This doesn't seem worth the effort at all to me. |
I am not sure how to use basic temporal comparison to express "Filter rows to 10 AM New York time" without extracting individual pieces. Can you give an example?
This is a one possibility. Although I felt it's slightly harder to use. i.e.
comparing to
(which is similar to the pandas API that does this) |
I'm not recommending we do that work right now. I'm just pointing out that the current data type does not capture all imaginable use cases and that is both ok and a very intentional decision. |
In Arrow, you can use the
|
Ah, rereading your comments I think I misunderstood your question. You are looking for all events that happened earlier than 10AM New York time regardless of the day? In that case temporal extraction is the correct answer. Still, you should be able to do this in Substrait:
|
Oops, my bad. I thought that the UNIX epoch was essentially a Rereading the spec, the supported range for Rereading this thread too now I'm getting really confused about what the actual question was.
A timezone is either an hour/minute delta between -13:00 and +13:00 or a region string that includes DST shenanigans and thus only relates to said hour/minute delta by way of politics and the current date. So, it'd depend on which one you're after. For the former the software dev in me would use an
Okay, so I see a function with two arguments, namely the timestamp to be converted and a format string, unless you'd really want to make the latter implicit and (ab)use the cast operation for it...
Wait, no. This is only true if you're converting string representations of timestamps without timezones to a
This is where I'm really getting lost.
It's an interesting question whether a timestamp + timezone pair would be compliant with Substrait's Can you point out where I'm going wrong in understanding your question? And is this thread starting to get really off-topic or am I just completely lost here? |
@jvanstraten, let's see if I can clarify. I suspect a good part of the confusion is due to my Substrait-producer-to-consumer (Ibis->Substrait->Arrow) point of view, as opposed to a pure Substrait one.
I use the region string interpretation.
Agreed. My starting point for thinking about this was Arrow's
You're right. I wrote conversion to a "timestamp" field when I meant "timestamp + timezone".
I am referring to Substrait protobuf's |
Okay, then part of the confusion here arises from the fact that Arrow's Substrait consumer wrongly treats casts as functions with the name In general, having worked on the initial PR for the Arrow Substrait consumer: it's not only incomplete, there's a lot wrong with it, both due to not wanting to implement all the corner cases right off the bat and due misinterpretations of the Substrait spec. Please don't treat it as if it's set in stone. I only stopped working on it because essentially I got distracted trying to understand Substrait properly; it's now been months and I still don't feel like I've grokked it completely...
So this is about adding a field to the type of I don't personally think we should make this change, especially for something as finicky as timezones. Half the Substrait spec would then have to be about which timezone identifiers are legal and how they should behave. What I do think should be added to Substrait is a way to specify compound extension types, so Arrow and Ibis can properly define an extension for this. Without compound extension types, this would require defining a separate type (or type variation) for all timezones, which would get very annoying very fast. IIRC there were also types like this that have uncountably infinite parameterizations, which absolutely need parameterizable extension types (I don't remember which, but I'm probably thinking of unions). This is a big can of worms that deserves its own issue, though. I propose we try to limit the discussion in this thread to the original question, and open separate issues for the other things that have come up, if needed. |
Good catch. I've created https://issues.apache.org/jira/browse/ARROW-16550
So the original question was to add a timezone to the type parameter. So let's look at a specific example. Problem statement: Given a column of instants (TIMESTAMP_TZ), get the hour of the day in the The "parameterized type" approach is:
The "parameterized function" approach is:
So, in my understanding, this issue boils down to a choice between the above two approaches. I am a fan of "parameterized function" because I think it is more intuitive and, in my experiences with Arrow (which uses the parameterized type approach), it has been really hard for people to understand the purpose of the type parameter. |
I think your understanding is spot on. And it's probably a longer discussion on which one is better and they are not that different. Pandas/Python does the former so most Python users would be more familiar with that. I have my personal preference to be the former but that's because I am more used to Pandas API, and with latter you would need to add "timezone" param to potentially many functions, instead of just having that on the Timestamp type. Also the former is something that Arrow already supports Today and I am bit surprised to hear from Weston that this is not the way to go (not sure if that has consensus from Arrow side already). But I am not going to argue hard for it, both are workable IMO. But if the second is the way to go, I would suggest making this function
|
@jvanstraten, thanks for this explanation. I still have a way to go myself to grok Ibis and Substrait.
My understanding is that Ibis precedes Arrow in this handling of cast, because its
You're right, though note that the existing code still works (not to say this is a good thing). I've seen Substrait happily create multiple extension-functions with the same |
I've never seen a single line of code in Ibis, or even used it personally, so I didn't follow much of that. But it sounds to me like the Ibis producer is the problem, then.
You read my mind with that paragraph! That's what the validator is for :) Unfortunately it doesn't validate extensions and function calls yet, because type expressions have been a major headache for me thus far. It generates warnings to say it doesn't check them yet, though. With respect to the function parameterization vs type parameterization discussion, I have no opinion either way. |
I wasn't aware of this validator. Side question: could you point to documentation about it? |
I think I linked it in a previous comment, but here. It hasn't been merged yet (it's stable, but pending review), so that's why there's no mention on the website yet. |
FYI, for context, the decision to encode time zones as separate options (and thus use the "parameterized function" approach) has some previous discussion here: #2 (comment) |
I don't know much about Substrait at this point but I have implemented a bunch of temporal kernels in Arrow so I can speak from that experience. If timestamps are stored in UTC then local timezone is needed when "viewing", e.g. extracting time components, strftime, days/months/..._between, floor/ceil/round, is_daylightsavings, is_leap_year. The example @icexelloss gives for hour/minute/second extraction in arrow kernel looks like this. UTC timestamp is converted to local epoch time and is rounded to a desired unit. Timezone is needed, but the source does not matter much. On one hand timestamp type parametrisation in Arrow turns out to be unpractical because timezone needs to be extracted at runtime which is not great for templating kernels. On the other hand type parametrisation saves users some work and reduces error surface because timezone is lugged around with type. I don't feel strongly about either approach, but I really disliked templating complications around parametric type. I don't see timestamp type and timezone as indivisible as long as timestamp in UTC. A UTC timestamp represents an unambiguous point on time continuum and will always exist. Local timestamps on the other hand would arguably have more use for timezone information as they have near zero meaning without it. I can sketch out a PR for a subset temporal functions as they are currently in Arrow to help inform this discussion. |
Sorry I didn't get to this sooner. I've opened a PR #222. |
I believe this was discussed further, and ultimately addressed, in #222. There are some functions which take the time zone as an argument (e.g. extracting a year from a timestamp_tz). These functions accept the time zone as a string today and add that it must be a tzdb time zone string. I think it would be an interesting idea to create a new data type / string-variant purely for time zones :) but I'll leave that effort for future development. |
How should a timezone value be represented within a Substrait plan? A use case I have in mind for a timezone value in Substrait is casting a string-typed field, which could have values like
2/2/2022 2:20:22
, to a timestamp-typed field; this conversion requires a timezone value to be meaningful. I'd propose that changing the Substrait spec ofTimezoneTZ
to allow atz
attribute. Currently, the timezone interpretation of aTimezoneTZ
is implicit, and at least in Arrow it defaults to UTC.Note that this is a follow up on the discussion here.
The text was updated successfully, but these errors were encountered: