Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hash calculation for Timestamp in HiveBucketing to be Hive Compatible #22980

Merged
merged 1 commit into from
Jun 14, 2024

Conversation

kewang1024
Copy link
Collaborator

@kewang1024 kewang1024 commented Jun 11, 2024

Presto Java calculates hash for Timestamp type using formula:
hash(seconds << 30 + milliseconds)
Hive:
hash(seconds << 30 + nanoseconds)
Spark:
hash(seconds << 30 + nanoseconds)
Velox:
hash(seconds << 30 + nanoseconds)

To keep a consistent behavior with other engines, fix the hash calculation formula for Timestamp.

== RELEASE NOTES ==

Hive Connector Changes
* Fix hash calculation for Timestamp column to be hive compatible when writing to a table bucketed by Timestamp.  :pr:`22890`
* Add config `hive.legacy-timestamp-bucketing` and session property ``hive.legacy_timestamp_bucketing`` to use the original hash function for Timestamp column, which is not hive compatible. :pr:`22890`

Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great find! Can you add a release note for this bug fix.

This could also cause problems with incompatibility between old partitions and new partitions using different bucketing algorithms, and possibly wrong results reading old partitions and filtering out the wrong buckets, joining the wrong buckets, etc. I wonder if we need to add a configuration property for hive(e.g. called something like legacy-timestamp-bucketing) to help with the migration here.

@@ -109,6 +109,10 @@ public void testHashingCompare()
assertBucketEquals("timestamp", new Timestamp(1000 * LocalDateTime.of(1969, 12, 31, 23, 59, 59, 999_000_000).toEpochSecond(ZoneOffset.UTC)));
assertBucketEquals("timestamp", new Timestamp(1000 * LocalDateTime.of(1950, 11, 19, 12, 34, 56, 789_000_000).toEpochSecond(ZoneOffset.UTC)));
assertBucketEquals("timestamp", new Timestamp(1000 * LocalDateTime.of(2015, 11, 19, 7, 6, 5, 432_000_000).toEpochSecond(ZoneOffset.UTC)));
assertBucketEquals("timestamp", new Timestamp(10 + 1000 * LocalDateTime.of(1970, 1, 1, 0, 0, 0, 0).toEpochSecond(ZoneOffset.UTC)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split this into a separate test (e.g. testTimestampBucketWithNanoseconds)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all the data types are tested within testHashingCompare, I would keep them at the same place since testing millisecond doesn't sound like an edge case, it's just the original test somehow failed to include those. I added comment to make it more straightforward.

Let me know if you feel strongly against it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my motivation is basically that when you have a zillion test cases in a single test it gets annoying to debug when it fails, and you lose some signal (it fails on case x, so the next case doesn't run, then you fix x and run the test again to discover there was also an issue with y). I think ideally the original test is split up too, and since you're adding new tests, it's better at least not to exacerbate the problem.

@kewang1024 kewang1024 changed the title Fix hash calculation for Timestamp in HiveBucketing Fix hash calculation for Timestamp in HiveBucketing to be Hive Compatible Jun 11, 2024
@kewang1024 kewang1024 force-pushed the fix-timestamp branch 2 times, most recently from 1e35a2d to 246f01d Compare June 11, 2024 21:43
sdruzkin
sdruzkin previously approved these changes Jun 11, 2024
Copy link
Collaborator

@sdruzkin sdruzkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ORC stuff LGTM

jaystarshot
jaystarshot previously approved these changes Jun 12, 2024
@kewang1024
Copy link
Collaborator Author

Other PRs are also failing on singlestore tests so it's unrelated to this change, @rschlussel

Copy link
Contributor

@ClarenceThreepwood ClarenceThreepwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty insidious bug. Thanks for fixing!
nit - commit message is too long

Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. thanks for the fix!

rschlussel
rschlussel previously approved these changes Jun 14, 2024
@steveburnett
Copy link
Contributor

Suggest minor changes to release note entry

== RELEASE NOTES ==

Hive Connector Changes
* Fix hash calculation for Timestamp column to be hive compatible when writing to a table bucketed by Timestamp.  :pr:`22890`
* Add config `hive.legacy-timestamp-bucketing` and session property ``hive.legacy_timestamp_bucketing`` to use the original hash function for Timestamp column, which is not hive compatible. :pr:`22890`

@kewang1024 kewang1024 merged commit fd8d572 into prestodb:master Jun 14, 2024
55 of 56 checks passed
@tdcmeehan tdcmeehan mentioned this pull request Aug 23, 2024
34 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants