-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Compute murmur3 hash with dictionary input correctly #433
Conversation
} | ||
} | ||
} | ||
hash_array_boolean!(BooleanArray, col, i32, hashes_buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering why this is a macro. It looks like this is the only use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two reasons:
- it could be reused for xxhash64 too, which I am currently working in feat: Add xxhash64 function support #424
- mainly style issue, to be consistent with other types in this function, which are all called by macro.
spark/src/main/scala/org/apache/comet/expressions/CometCast.scala
Outdated
Show resolved
Hide resolved
macro_rules! hash_array_boolean { | ||
($array_type: ident, $column: ident, $hash_input_type: ident, $hashes: ident) => { | ||
let array = $column.as_any().downcast_ref::<$array_type>().unwrap(); | ||
if array.null_count() == 0 { | ||
for (i, hash) in $hashes.iter_mut().enumerate() { | ||
*hash = spark_compatible_murmur3_hash( | ||
$hash_input_type::from(array.value(i)).to_le_bytes(), | ||
*hash, | ||
); | ||
} | ||
} else { | ||
for (i, hash) in $hashes.iter_mut().enumerate() { | ||
if !array.is_null(i) { | ||
*hash = spark_compatible_murmur3_hash( | ||
$hash_input_type::from(array.value(i)).to_le_bytes(), | ||
*hash, | ||
); | ||
} | ||
} | ||
} | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pull out as a macro because you will use different hash function other than spark_compatible_murmur3_hash
later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. It could be used to support xxhash64 function.
cadb5be
to
a417a20
Compare
test("hash functions with random input") { | ||
val dataGen = DataGenerator.DEFAULT | ||
// sufficient number of rows to create dictionary encoded ArrowArray. | ||
val randomNumRows = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some note here:
I'm not 100 percent sure how we could trigger a dictionary array in the native side from Spark.
When the random number is small, such as 100/200, there's no dictionary array involved in the native side, although the parquet should be written as all columns dictionary encoded.
I tweaked a bit and settled with 1000, which triggers a dictionary encoded ArrowArray in the rust side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially we can add repeated values to force dictionary. E.g. randomly generate 100 rows and repeat 10 times to make 1000 rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g. randomly generate 100 rows and repeat 10 times to make 1000 rows
So dictionary encoding is only triggered with enough repetition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makeParquetFileAllTypes
or some existing dictionary related tests may be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Parquet file writer will automatically generate a dictionary if the cardinality is low (i.e there is a small number of unique values).
|insert into $table values | ||
|('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.999999) | ||
|, ('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.999999) | ||
|""".stripMargin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you insert extra space characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, this is by accident. Let me try again and revert it.
Did another check. The current version now has 4 space indentations, which should be correct.
I think it was wrong in the previous commit and could be updated in this PR.
spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala
Outdated
Show resolved
Hide resolved
@viirya @kazuyukitanimura @sunchao PTAL when you have time. |
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Gently ping @viirya @sunchao and @andygrove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you @advancedxy
* fix: Handle compute murmur3 hash with dictionary input correctly * add unit tests * spotless apply * apply scala fix * address comment * another style issue * Update spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 93af704)
Which issue does this PR close?
Closes #427
Rationale for this change
Bug fixes. When submitting #424, we found there's a bug in spark_hash, which doesn't handle dictionary array correctly.
This PR tries to fix this first.
What changes are included in this PR?
This PR currently depends on #426, will rebase once that's merged.
How are these changes tested?
Updated test with randomized input.