Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-861: Document INT96 timestamps #49

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,18 @@ example, there is no requirement that a large number of days should be
expressed as a mix of months and days because there is not a constant
conversion from days to months.

### INT96 timestamps (also called IMPALA_TIMESTAMP)

_(deprecated)_ Timestamps saved as an `int96` are made up of the nanoseconds in the day
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@zivanfi, who is working on making timestamps more uniform across several query engines.

Hive also writes Parquet files in this format, although Impala seems to have used it first. iIt would be good if we could include documentation for 12 byte binary timestamps in either #46 or here. Otherwise we'd be deprecating a widely used format with no clear alternative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, let's add a paragraph here about where it is used.
The deprecation JIRA should link to corresponding Impala and Hive JIRAs.
I think we're signing up for being able to read that type forever for compatibility with existing files but provide a better alternative for writing timestamp.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion from a random passer by... I think it might be nice to document the byte order of these timestamps as well, as they appear to be little endian whereas other primitive ints in parquet are big endian. At least that seems to be the issue that's lead us to finding this PR in the first place :)

(first 8 byte) and the Julian day (last 4 bytes). No timezone is attached to this value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to specify the signedness - the first 8 bytes are a uint64 and the last 4 bytes are a int32 I believe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember reading somewhere that the signedness is explicitly specified as undefined, because neither the date nor the time component can be so large as to reach the most significant bit, but unfortunately I can't find where I read this. However, this can only be true if we don't allow dates before the epoch of Julian date, because that would make the date part negative.

To convert the timestamp into nanoseconds since the Unix epoch, 00:00:00.000000
on 1 January 1970, the following formula can be used:
`(julian_day - 2440588) * (86400 * 1000 * 1000 * 1000) + nanoseconds`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formula looks good to me. Technically, Julian days refer to the noon of a day, so everywhere else you'd find "2440587.5" instead. However, boost uses only the day part of that, so for Impala your formula looks correct.

We might consider changing it to seconds since the epoch, since that seems more commonly used: (julian_day - 2440588) * 86400 + nanoseconds / (1000 * 1000 * 1000)

The magic number `2440588` is the julian day for 1 January 1970.

Note that these timestamps are the common usage of the `int96` physical type and are not
marked with a special logical type annotation.

## Embedded Types

### JSON
Expand Down