Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-861: Document INT96 timestamps #49

Closed
wants to merge 2 commits into from
Closed

Conversation

xhochy
Copy link
Member

@xhochy xhochy commented Feb 3, 2017

No description provided.

@gszadovszky
Copy link
Contributor

Thanks for adding this. LGTM.

@zivanfi
Copy link
Contributor

zivanfi commented Feb 6, 2017

I find this addition valuable, but a little confusing as well at the same time, at least in its current form. As far as I can tell, the Impala timestamp is not a logical type at all in the sense that it is not something that can be specified in the Parquet schema - it's just an INT96 value. I think this should be explicitly mentioned, because all other logical types on this page are actual logical types of Parquet itself.

@xhochy
Copy link
Member Author

xhochy commented Feb 6, 2017

Currently the only known use "in the wild" for INT96 is this sort of timestamps. INT96 isn't mentioned with the physical data types and I though putting it into this section would be the most helpful as this is where all timestamps are mentioned.

I will at least add a small section mentioning that INT96 is no logical but a physical type.

@zivanfi
Copy link
Contributor

zivanfi commented Feb 6, 2017

Thanks, that will clarify it in my opinion.

@zivanfi
Copy link
Contributor

zivanfi commented Feb 6, 2017

LGTM, thanks!

(first 8 byte) and the Julian day (last 4 bytes). No timezone is attached to this value.
To convert the timestamp into nanoseconds since the Unix epoch, 00:00:00.000000
on 1 January 1970, the following formula can be used:
`(julian_day - 2440588) * (86400 * 1000 * 1000 * 1000) + nanoseconds`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formula looks good to me. Technically, Julian days refer to the noon of a day, so everywhere else you'd find "2440587.5" instead. However, boost uses only the day part of that, so for Impala your formula looks correct.

We might consider changing it to seconds since the epoch, since that seems more commonly used: (julian_day - 2440588) * 86400 + nanoseconds / (1000 * 1000 * 1000)

@@ -144,6 +144,18 @@ example, there is no requirement that a large number of days should be
expressed as a mix of months and days because there is not a constant
conversion from days to months.

### INT96 timestamps (also called IMPALA_TIMESTAMP)

_(deprecated)_ Timestamps saved as an `int96` are made up of the nanoseconds in the day
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@zivanfi, who is working on making timestamps more uniform across several query engines.

Hive also writes Parquet files in this format, although Impala seems to have used it first. iIt would be good if we could include documentation for 12 byte binary timestamps in either #46 or here. Otherwise we'd be deprecating a widely used format with no clear alternative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, let's add a paragraph here about where it is used.
The deprecation JIRA should link to corresponding Impala and Hive JIRAs.
I think we're signing up for being able to read that type forever for compatibility with existing files but provide a better alternative for writing timestamp.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion from a random passer by... I think it might be nice to document the byte order of these timestamps as well, as they appear to be little endian whereas other primitive ints in parquet are big endian. At least that seems to be the issue that's lead us to finding this PR in the first place :)

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -144,6 +144,18 @@ example, there is no requirement that a large number of days should be
expressed as a mix of months and days because there is not a constant
conversion from days to months.

### INT96 timestamps (also called IMPALA_TIMESTAMP)

_(deprecated)_ Timestamps saved as an `int96` are made up of the nanoseconds in the day
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, let's add a paragraph here about where it is used.
The deprecation JIRA should link to corresponding Impala and Hive JIRAs.
I think we're signing up for being able to read that type forever for compatibility with existing files but provide a better alternative for writing timestamp.

### INT96 timestamps (also called IMPALA_TIMESTAMP)

_(deprecated)_ Timestamps saved as an `int96` are made up of the nanoseconds in the day
(first 8 byte) and the Julian day (last 4 bytes). No timezone is attached to this value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to specify the signedness - the first 8 bytes are a uint64 and the last 4 bytes are a int32 I believe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember reading somewhere that the signedness is explicitly specified as undefined, because neither the date nor the time component can be so large as to reach the most significant bit, but unfortunately I can't find where I read this. However, this can only be true if we don't allow dates before the epoch of Julian date, because that would make the date part negative.

@rdblue
Copy link
Contributor

rdblue commented Feb 14, 2017

What is the purpose of adding this to the Parquet spec? I don't expect anyone to attempt to implement it, considering that Impala has been planning on moving to the int64 timestamp for years. Has that changed?

Adding a type to the spec carries with it some expectation that it will be supported by implementations, and I'd rather focus effort on moving to the date/time types already in the spec instead of legitimizing this.

@xhochy
Copy link
Member Author

xhochy commented Feb 14, 2017

I've actually implemented the read path for this recently in parquet-cpp and given that e.g. Spark actually doesn't read TIMESTAMP_MILLIS but rather throws typeNotImplemented , ( see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L161 ) we might move on and also implement writing INT96 timestamps in parquet-cpp.

@julienledem
Copy link
Member

@rdblue as I understand Hive is also using int96?
@xhochy: what's the reason for implementing the write path on the cpp side?

I think we should have JIRAs in each project that writes int96s (Impala, Hive) and have them move to TIMESTAMP_MILLIS or . We can support reading int96 for backward compatibility and even possibly write them with a flag turned on if people really need this.

Impala/Hive people: I recall the int96 was because of nanosecond timestamps? Do wee need a TIMESTAMP_NANOS logical type backed by a fixed_length_bytearray physical type?

@xhochy
Copy link
Member Author

xhochy commented Feb 15, 2017

I would implement the write path only to support passing Timestamps to systems that only support timestamps as INT96 such as Spark does.

@lekv
Copy link
Contributor

lekv commented Feb 15, 2017

I don't know the historical reasons why Impala implements timestamps the way it does, but we use boost internally to process them and from comments in the source it sounds like we went with that and just swapped the members (compared to boost) to save space. I think Hive followed Impala.

Right now we store timestamps in memory in the same format that we write to parquet. IMPALA-4825 will swap the fields to align the storage layout with Kudu. However there are currently no plans to switch Impala's internal representation, so we would need to add conversion code to and from a new Parquet format, which may impact performance.

Should we go ahead and open JIRAs for Impala and Hive to discuss the feasibility of that?

@rdblue
Copy link
Contributor

rdblue commented Feb 16, 2017

The history of the int96 timestamp is that Impala had a 12-byte internal representation that was written directly to int96 because the field was 12 bytes and not used for anything else -- it didn't require discussion in the community like adding a logical type.

The type has nanosecond precision because boost did; it wasn't based on any use case, so we probably don't need a TIMESTAMP_NANOS type. It is also a bad for users that this type persists because it uses 4 extra bytes per value before encoding, and can't take advantage of integer compression like delta encoding.

@xhochy, I think the solution is to add support for int64 timestamp-micros to Spark, not to add more support for this. I'll add support in Spark and can help with the other engines as well.

@julienledem
Copy link
Member

@rdblue I agree that we want to deprecate int96 but it is useful to have written down somewhere what the existing files with int96 in them actually contain. We should add a comment that min/max stats are always invalid for int96.

@julienledem
Copy link
Member

@rdblue unfortunately the int96 has already leaked in other projects (drill, spark, hive), so read support for this will have to stay.
I fully agree that we should mark it deprecated for write and use the preferred int64 timestamp-micros alternative. However the existing format should be documented including its shortcomings (deprecated, no sort order, etc)
Possibly just add a more strongly comment at the beginning of the int96 description?

@rdblue
Copy link
Contributor

rdblue commented Jun 8, 2017

I'd rather not do anything to encourage further use of this convention, given how many bugs and problems there have been with it. Databases that have implementations won't benefit, and it would only allow more people to implement it. That said, I won't block it if this is something that everyone else thinks is reasonable.

lekv pushed a commit to lekv/parquet-format that referenced this pull request Jul 31, 2017
…#include scrubbing

This is the completion of work I started in PARQUET-442. This also resolves PARQUET-277 as no boost headers are included in the public API anymore.

I've done some scrubbing of #includes using Google's Clang-based include-what-you-use tool. PARQUET-522 can also be resolved when this is merged.

Author: Wes McKinney <wes@cloudera.com>

Closes apache#49 from wesm/PARQUET-446 and squashes the following commits:

e805a0c [Wes McKinney] Use int64_t for scanner batch sizes
503b1c1 [Wes McKinney] Fix mixed-up include guard names
4c02d2b [Wes McKinney] Refactor monolithic encodings/encodings.h
9e28fc3 [Wes McKinney] Finished IWYU path. Some imported impala code left unchanged for now
6d4af8e [Wes McKinney] Some initial IWYU
5be40d6 [Wes McKinney] Remove outdated TODO
2e39062 [Wes McKinney] Remove any boost #include dependencies
07059ca [Wes McKinney] Remove serialized-page.* files, move serialized-page-test to parquet/file
9458b36 [Wes McKinney] Add more headers to parquet.h public API
b4b0412 [Wes McKinney] Remove Thrift compiled headers from public API and general use outside of deserialized-related internal headers and code paths. Add unit test to enforce this
@shevek
Copy link

shevek commented Mar 4, 2018

A note to anyone else who has just lost a working day dealing with the edicts of dead popes and eventually discovers this ticket: The data format is LITTLE ENDIAN, at least in the samples which have been brightening my day. You decode it using JodaTime and Guava thusly:

long nanos = Longs.fromBytes(data[7], data[6], data[5], data[4], data[3], data[2], data[1], data[0]);
long julianday = Ints.fromBytes(data[11], data[10], data[9], data[8]);
long epochmillis = DateTimeUtils.fromJulianDay(julianday) + TimeUnit.NANOSECONDS.toMillis(nanos);

Given that I'm fairly sure all these machines are Intel, this may be a hardware-dependent statement. I wish you luck.

@zivanfi
Copy link
Contributor

zivanfi commented Mar 5, 2018

It's even more complicated than that. INT96 was originally not supposed to store timestamps, but int96 numbers, hence the name. When storing int96 numbers, the data must be written using little endian byte order, but the corresponding statistics are handled by thrift which uses a big endian byte order. So the values in the data and in the statistics use a different byte order.

The bytes of a timestamp are stored in the reverse order, thereby the data becomes big endian and the statistics become little endian. However, if a read path already handles the endianness of int96 values as if they were numbers, your code will get bytes as if they were a number. Because the timestamp byte order is the reverse of number byte order, you will need to handle those as if they were in a little endian order, since that leads to reversing them back.

Long story short: This is a nightmare.

@wierzba3
Copy link

wierzba3 commented Feb 12, 2019

It's even more complicated than that. INT96 was originally not supposed to store timestamps, but int96 numbers, hence the name. When storing int96 numbers, the data must be written using little endian byte order, but the corresponding statistics are handled by thrift which uses a big endian byte order. So the values in the data and in the statistics use a different byte order.

The bytes of a timestamp are stored in the reverse order, thereby the data becomes big endian and the statistics become little endian. However, if a read path already handles the endianness of int96 values as if they were numbers, your code will get bytes as if they were a number. Because the timestamp byte order is the reverse of number byte order, you will need to handle those as if they were in a little endian order, since that leads to reversing them back.

Long story short: This is a nightmare.

@zivanfi -- Do you know how I can go about writing this INT96 timestamp type to a parquet file? I cannot find any specification that explains exactly how to encode the value.

Here is a SO question I just posted that provides much more detail...
https://stackoverflow.com/questions/54657496/how-to-write-timestamp-logical-type-int96-to-parquet-using-parquetwriter?noredirect=1#comment96106843_54657496

(I am well aware of the negative sentiment towards this encoding type, but unfortunately I must use this encoding because it is required by our system.)

@zivanfi
Copy link
Contributor

zivanfi commented Feb 13, 2019

@wierzba3 I answered your question on SO.

@rdblue
Copy link
Contributor

rdblue commented Feb 13, 2019

@zivanfi, @wierzba3, I noted this on the SO issue: the reason why I think adding this documentation is because people will write INT96 timestamps. Those are broken and should not be used. It should remain difficult to write them to discourage it.

Parquet has documented and supported timestamps you should use instead.

@shevek
Copy link

shevek commented Feb 13, 2019

Marking it as deprecated is all very well, but those of us who have to read legacy data written by older systems STILL NEED DOCUMENTATION.

@rdblue
Copy link
Contributor

rdblue commented Feb 13, 2019

@shevek, the question was about how to write data into this unsupported format, not read it.

@shevek
Copy link

shevek commented Feb 24, 2019

@rdblue The title of the ticket is "Document INT96 timestamps", and the ticket has been closed without such documentation. In fact, the argument has been that it should not BE documented, and that's totally unhelpful to me. What would happen if I were to raise another ticket titled "please document int96 timestamps?" because I still have to read legacy data written in that format.

It seems fairly evident that the deprecation argument aside, this PR should be merged.

@timarmstrong
Copy link
Contributor

It's ultimately up to Parquet committers to decide what's best for the project and community.

I think it would be helpful to have this documented, but we can get creative with how we might discourage people from using it. E.g. put the documentation in a separate file BrokenLogicalTypes.md, add more aggressive disclaimers "Writing this timestamp will anger the Parquet gods and guarantee all of your timestamps are off by many hours", "WARNING: This logical type contains chemicals known to the State of California to cause cancer and birth defects or other reproductive harm.", or whatever.

@rdblue
Copy link
Contributor

rdblue commented Feb 24, 2019

To me, the question is: are we better off as a community with documentation for this, or without it?

I think we are better off without it.

As we can see from James' request, people are interested in building new code to write these values. Given that there are so many problems with them, the intent must be to work with other systems that already use INT96 timestamps, but don't work with the supported timestamp types. Expanding the set of systems that use INT96 timestamps -- read or write -- instead of adding support for the recommended ones to the ones that currently use INT96 only perpetuates the use of the bad type. That's bad for the community and bad for end users.

According to CDH 6.1 docs, Impala is still broken and cannot use the supported timestamp types:

Parquet consideration: int96 encoded Parquet timestamps are supported in Impala. int64 timestamps will be supported in a future release.

Considering the amount of pain and effort people spend to make INT96 timestamps work, I think time would be better spent contributing a fix there, so everyone can finally move off of INT96 timestamps.

@zivanfi
Copy link
Contributor

zivanfi commented Feb 25, 2019

@rdblue We have already added INT64 timestamp read support to Impala and write support is under review as well:

But even with these new developments, we still can't write interoperable INT64 timestamps until parquet-mr 1.11.0 get released, because Java clients can't read back non-UTC-normalized timestamps (nor nanosec-precision ones) using older parquet-mr versions. We already have changes in review for this functionality as well though:

@timarmstrong
Copy link
Contributor

Yeah Impala has been moving on it - definitely the way it was done initially has created a mess - but unfortunately it will take a while to completely move all users off int96 Impala timestamps, given upgrade cycles, policies about backwards-incompatible changes, etc.

HyukjinKwon pushed a commit to apache/spark that referenced this pull request Oct 20, 2020
### What changes were proposed in this pull request?
1. Add the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` to control timestamps rebasing in saving them as INT96. It supports the same set of values as `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` but the default value is `LEGACY` to preserve backward compatibility with Spark <= 3.0.
2. Write the metadata key `org.apache.spark.int96NoRebase` to parquet files if the files are saved with `spark.sql.legacy.parquet.int96RebaseModeInWrite` isn't set to `LEGACY`.
3. Add the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` to control loading INT96 timestamps when parquet metadata doesn't have enough info (the `org.apache.spark.int96NoRebase` tag) about parquet writer - either INT96 was written by Proleptic Gregorian system or some Julian one.
4. Modified Vectorized and Parquet-mr Readers to support loading/saving INT96 timestamps w/o rebasing depending on SQL config and the metadata tag:
    - **No rebasing** in testing when the SQL config `spark.test.forceNoRebase` is set to `true`
    - **No rebasing** if parquet metadata contains the tag `org.apache.spark.int96NoRebase`. This is the case when parquet files are saved by Spark >= 3.1 with `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is set to `CORRECTED`, or saved by other systems with the tag `org.apache.spark.int96NoRebase`.
    - **With rebasing** if parquet files saved by Spark (any versions) without the metadata tag `org.apache.spark.int96NoRebase`.
    - Rebasing depend on the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` if there are no metadata tags `org.apache.spark.version` and `org.apache.spark.int96NoRebase`.

New SQL configs are added instead of re-using existing `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead` because of:
- To allow users have different modes for INT96 and for TIMESTAMP_MICROS (MILLIS). For example, users might want to save INT96 as LEGACY but TIMESTAMP_MICROS as CORRECTED.
- To have different modes for INT96 and DATE in load (or in save).
- To be backward compatible with Spark 2.4. For now, `spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read` are set to `EXCEPTION` by default.

### Why are the changes needed?
1. Parquet spec says that INT96 must be stored as Julian days (see apache/parquet-format#49). This doesn't mean that a reader ( or a writer) is based on the Julian calendar. So, rebasing from Proleptic Gregorian to Julian calendar can be not needed.
2. Rebasing from/to Julian calendar can loose information because dates in one calendar don't exist in another one. Like 1582-10-04..1582-10-15 exist in Proleptic Gregorian calendar but not in the hybrid calendar (Julian + Gregorian), and visa versa, Julian date 1000-02-29 doesn't exist in Proleptic Gregorian calendar. We should allow users to save timestamps without loosing such dates (rebasing shifts such dates to the next valid date).
3. It would also make Spark compatible with other systems such as Impala and newer versions of Hive that write proleptic Gregorian based INT96 timestamps.

### Does this PR introduce _any_ user-facing change?
It can when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set non-default value `LEGACY`.

### How was this patch tested?
- Added a test to check the metadata key `org.apache.spark.int96NoRebase`
- By `ParquetIOSuite`

Closes #30056 from MaxGekk/parquet-rebase-int96.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants