-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hash calculation for Timestamp in HiveBucketing to be Hive Compatible #22980
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great find! Can you add a release note for this bug fix.
This could also cause problems with incompatibility between old partitions and new partitions using different bucketing algorithms, and possibly wrong results reading old partitions and filtering out the wrong buckets, joining the wrong buckets, etc. I wonder if we need to add a configuration property for hive(e.g. called something like legacy-timestamp-bucketing) to help with the migration here.
@@ -109,6 +109,10 @@ public void testHashingCompare() | |||
assertBucketEquals("timestamp", new Timestamp(1000 * LocalDateTime.of(1969, 12, 31, 23, 59, 59, 999_000_000).toEpochSecond(ZoneOffset.UTC))); | |||
assertBucketEquals("timestamp", new Timestamp(1000 * LocalDateTime.of(1950, 11, 19, 12, 34, 56, 789_000_000).toEpochSecond(ZoneOffset.UTC))); | |||
assertBucketEquals("timestamp", new Timestamp(1000 * LocalDateTime.of(2015, 11, 19, 7, 6, 5, 432_000_000).toEpochSecond(ZoneOffset.UTC))); | |||
assertBucketEquals("timestamp", new Timestamp(10 + 1000 * LocalDateTime.of(1970, 1, 1, 0, 0, 0, 0).toEpochSecond(ZoneOffset.UTC))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you split this into a separate test (e.g. testTimestampBucketWithNanoseconds)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all the data types are tested within testHashingCompare
, I would keep them at the same place since testing millisecond doesn't sound like an edge case, it's just the original test somehow failed to include those. I added comment to make it more straightforward.
Let me know if you feel strongly against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my motivation is basically that when you have a zillion test cases in a single test it gets annoying to debug when it fails, and you lose some signal (it fails on case x, so the next case doesn't run, then you fix x and run the test again to discover there was also an issue with y). I think ideally the original test is split up too, and since you're adding new tests, it's better at least not to exacerbate the problem.
899664b
to
623335c
Compare
623335c
to
d742c83
Compare
presto-hive/src/main/java/com/facebook/presto/hive/HiveSessionProperties.java
Outdated
Show resolved
Hide resolved
1e35a2d
to
246f01d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ORC stuff LGTM
601a799
to
f75da1e
Compare
presto-hive-common/src/main/java/com/facebook/presto/hive/HiveCommonSessionProperties.java
Outdated
Show resolved
Hide resolved
f75da1e
to
c2a94ce
Compare
c2a94ce
to
28dc2c1
Compare
b52bb0a
to
7b8ba60
Compare
Other PRs are also failing on singlestore tests so it's unrelated to this change, @rschlussel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty insidious bug. Thanks for fixing!
nit - commit message is too long
7b8ba60
to
d08834e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. thanks for the fix!
presto-hive/src/main/java/com/facebook/presto/hive/HiveBucketing.java
Outdated
Show resolved
Hide resolved
Suggest minor changes to release note entry
|
d08834e
to
59b5341
Compare
Presto Java calculates hash for Timestamp type using formula:
hash(seconds << 30 + milliseconds)
Hive:
hash(seconds << 30 + nanoseconds)
Spark:
hash(seconds << 30 + nanoseconds)
Velox:
hash(seconds << 30 + nanoseconds)
To keep a consistent behavior with other engines, fix the hash calculation formula for Timestamp.