-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-861: Document INT96 timestamps #49
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -144,6 +144,18 @@ example, there is no requirement that a large number of days should be | |
expressed as a mix of months and days because there is not a constant | ||
conversion from days to months. | ||
|
||
### INT96 timestamps (also called IMPALA_TIMESTAMP) | ||
|
||
_(deprecated)_ Timestamps saved as an `int96` are made up of the nanoseconds in the day | ||
(first 8 byte) and the Julian day (last 4 bytes). No timezone is attached to this value. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to specify the signedness - the first 8 bytes are a uint64 and the last 4 bytes are a int32 I believe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I remember reading somewhere that the signedness is explicitly specified as undefined, because neither the date nor the time component can be so large as to reach the most significant bit, but unfortunately I can't find where I read this. However, this can only be true if we don't allow dates before the epoch of Julian date, because that would make the date part negative. |
||
To convert the timestamp into nanoseconds since the Unix epoch, 00:00:00.000000 | ||
on 1 January 1970, the following formula can be used: | ||
`(julian_day - 2440588) * (86400 * 1000 * 1000 * 1000) + nanoseconds`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The formula looks good to me. Technically, Julian days refer to the noon of a day, so everywhere else you'd find "2440587.5" instead. However, boost uses only the day part of that, so for Impala your formula looks correct. We might consider changing it to seconds since the epoch, since that seems more commonly used: |
||
The magic number `2440588` is the julian day for 1 January 1970. | ||
|
||
Note that these timestamps are the common usage of the `int96` physical type and are not | ||
marked with a special logical type annotation. | ||
|
||
## Embedded Types | ||
|
||
### JSON | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+@zivanfi, who is working on making timestamps more uniform across several query engines.
Hive also writes Parquet files in this format, although Impala seems to have used it first. iIt would be good if we could include documentation for 12 byte binary timestamps in either #46 or here. Otherwise we'd be deprecating a widely used format with no clear alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, let's add a paragraph here about where it is used.
The deprecation JIRA should link to corresponding Impala and Hive JIRAs.
I think we're signing up for being able to read that type forever for compatibility with existing files but provide a better alternative for writing timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion from a random passer by... I think it might be nice to document the byte order of these timestamps as well, as they appear to be little endian whereas other primitive ints in parquet are big endian. At least that seems to be the issue that's lead us to finding this PR in the first place :)