feat: Add xxhash64 function support #424

advancedxy · 2024-05-14T13:10:44Z

Which issue does this PR close?

Part of #205
Closes #344

Rationale for this change

More function coverage

What changes are included in this PR?

include twox-hash as a dep in rust
implement xxhash64 related method in rust side
glue code to bridge the jvm and rust

How are these changes tested?

New added test.

core/src/execution/datafusion/spark_hash.rs

andygrove · 2024-05-14T13:53:50Z

Thanks @advancedxy. I plan on reviewing this PR today.

Could you also update docs/source/user-guide/expressions.md to add xxhash64 as a supported expression?

advancedxy · 2024-05-14T14:00:56Z

Could you also update docs/source/user-guide/expressions.md to add xxhash64 as a supported expression?

Of course, I will update that among other things: such as the review comments and the inspection file: spark/inspections/CometTPCDSQueriesList-results.txt in a followup commit.

andygrove · 2024-05-14T14:41:44Z

I'd like to see the tests use some randomly generated inputs.

As a quick hack, I added the following test to CometCastSuite and it shows some differences in results between Spark and Comet.

  test("xxhash64") {
    val input = generateStrings(timestampPattern, 8).toDF("a")
    withTempPath { dir =>
      val data = roundtripParquet(input, dir).coalesce(1)
      data.createOrReplaceTempView("t")
      val df = spark.sql(s"select a, xxhash64(a) from t order by a")
      checkSparkAnswerAndOperator(df)
    }
  }

Some differences:

!== Correct Answer - 1000 ==           == Spark Answer - 1000 ==
 struct<a:string,xxhash64(a):bigint>   struct<a:string,xxhash64(a):bigint>
![,-7444071767201028348]               [,-1205034819632174695]
![	 23,-1992628079781282865]           [	 23,4312780814362028915]
![	31T3,5857608402363468958]           [	31T3,6089516869931970265]

We could extract the generate* methods from CometCastSuite into a separate DataGenerator class that other test suites can leverage.

andygrove · 2024-05-14T14:44:47Z

Our hash implementation is also not compatible with Spark. I will file an issue for that.

advancedxy · 2024-05-14T14:49:36Z

I'd like to see the tests use some randomly generated inputs.

As a quick hack, I added the following test to CometCastSuite and it shows some differences in results between Spark and Comet.
  test("xxhash64") {
    val input = generateStrings(timestampPattern, 8).toDF("a")
    withTempPath { dir =>
      val data = roundtripParquet(input, dir).coalesce(1)
      data.createOrReplaceTempView("t")
      val df = spark.sql(s"select a, xxhash64(a) from t order by a")
      checkSparkAnswerAndOperator(df)
    }
  }
Some differences:
!== Correct Answer - 1000 ==           == Spark Answer - 1000 ==
 struct<a:string,xxhash64(a):bigint>   struct<a:string,xxhash64(a):bigint>
![,-7444071767201028348]               [,-1205034819632174695]
![	 23,-1992628079781282865]           [	 23,4312780814362028915]
![	31T3,5857608402363468958]           [	31T3,6089516869931970265]
We could extract the generate* methods from CometCastSuite into a separate DataGenerator class that other test suites can leverage.

Good catch, and a good way to make sure the impl is correct. Let me check why the test is failing first.

advancedxy · 2024-05-14T15:17:04Z

Let me check why the test is failing first.

Found the issue. The create_hashes_dictionary doesn't handle input hashes correctly, it affects both murmur3hash and this new xxhash64 method.

Let me try to fix that first.

andygrove · 2024-05-14T15:20:56Z

See #426 for proposed DataGenerator class

andygrove · 2024-05-14T15:24:04Z

Our hash implementation is also not compatible with Spark. I will file an issue for that.

I filed #427

advancedxy · 2024-05-14T15:37:58Z

Our hash implementation is also not compatible with Spark. I will file an issue for that.

I filed #427

Thanks for filing this. I think it's the same issue for both murmur3 hash and xxhash64. I will submit a pr to fix that first.

advancedxy · 2024-05-14T16:25:50Z

Found the issue. The create_hashes_dictionary doesn't handle input hashes correctly, it affects both murmur3hash and this new xxhash64 method.

Let me try to fix that first.

I have submitted the fix in this PR and waiting for CI passes. I will create a separate PR to include the murmur3 hash fix and depends on your #426 in the morning (in Beijing time) first.

spark/src/test/scala/org/apache/comet/CometCastSuite.scala

core/src/execution/datafusion/spark_hash.rs

advancedxy · 2024-05-15T12:50:46Z

@andygrove @viirya I have created #433 and mark this as a draft. We should merge that first and then come back to this PR . PLAL when you have tome.

advancedxy · 2024-05-28T01:15:05Z

@andygrove @viirya @parthchandra and @sunchao would you mind to take a look at this? I think it's ready for review.

spark/inspections/CometTPCHQueriesList-results.txt

core/src/execution/datafusion/spark_hash.rs

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

kazuyukitanimura · 2024-05-30T05:20:03Z

core/src/execution/datafusion/expressions/scalar_funcs.rs

+            let num_rows = args[0..args.len() - 1]
+                .iter()
+                .find_map(|arg| match arg {
+                    ColumnarValue::Array(array) => Some(array.len()),
+                    ColumnarValue::Scalar(_) => None,
+                })
+                .unwrap_or(1);
+            let mut hashes: Vec<u64> = vec![0_u64; num_rows];
+            hashes.fill(*seed as u64);
+            let arrays = args[0..args.len() - 1]
+                .iter()
+                .map(|arg| match arg {
+                    ColumnarValue::Array(array) => array.clone(),
+                    ColumnarValue::Scalar(scalar) => {
+                        scalar.clone().to_array_of_size(num_rows).unwrap()
+                    }
+                })
+                .collect::<Vec<ArrayRef>>();


nit: I feel this can be simplified a little bit

let arrays = args[0..args.len() - 1] ...; let mut hashes: Vec<u64> = vec![0_u64; arrays.len()]; hashes.fill(*seed as u64);

hmm. I think we have to compute num_rows first?

kazuyukitanimura · 2024-05-30T05:28:48Z

core/src/execution/datafusion/spark_hash.rs

+                DataType::Boolean => {
+                    hash_array_boolean!(BooleanArray, col, i32, $hashes_buffer, $hash_method);
+                }


nit: I wonder if we can make BooleanArray and i32 as macro argument, so that we can reduce this large case match...

hmm, let me give it a try. I will report back if it's too hard to do that.

If I understands your proposal correctly, do you mean something like:

match col.data_type() { DataType::Int8 | DataType::Int16: | DataType::Int32 | DataType::Int64 | DataType::UInt8 | DataType::UInt16 | DataType::UInt32 | DataType::UInt64 => { hash_array_primitive!(get_array_type_of!(col.data_type()), col, get_input_native_type_of!(col.data_type()), $hashes_buffer, $hash_method); } .... }

?

I tried to implement that, but couldn't find a way to do that. The col.data_type() is a runtime value, I don't we can infer it in the compile-time.

advancedxy · 2024-06-03T02:19:41Z

Gently ping @andygrove @viirya, do you have any more comments?

andygrove

This looks great to me. Thank you @advancedxy

advancedxy · 2024-06-04T02:31:48Z

Thanks all for reviewing, @andygrove @viirya @kazuyukitanimura @parthchandra

* feat: Add xxhash64 function support * Update related docs * Update core/src/execution/datafusion/spark_hash.rs Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * Update QueriesList results --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Parth Chandra <parthc@apple.com>

advancedxy commented May 14, 2024

View reviewed changes

core/src/execution/datafusion/spark_hash.rs Show resolved Hide resolved

core/src/execution/datafusion/spark_hash.rs Outdated Show resolved Hide resolved

advancedxy force-pushed the xxhash_support branch from 27112b6 to dd9738d Compare May 14, 2024 13:19

viirya reviewed May 14, 2024

View reviewed changes

spark/src/test/scala/org/apache/comet/CometCastSuite.scala Outdated Show resolved Hide resolved

viirya reviewed May 14, 2024

View reviewed changes

core/src/execution/datafusion/spark_hash.rs Outdated Show resolved Hide resolved

viirya reviewed May 14, 2024

View reviewed changes

core/src/execution/datafusion/spark_hash.rs Outdated Show resolved Hide resolved

viirya reviewed May 14, 2024

View reviewed changes

core/src/execution/datafusion/spark_hash.rs Outdated Show resolved Hide resolved

viirya reviewed May 14, 2024

View reviewed changes

core/src/execution/datafusion/spark_hash.rs Outdated Show resolved Hide resolved

advancedxy mentioned this pull request May 15, 2024

fix: Compute murmur3 hash with dictionary input correctly #433

Merged

advancedxy marked this pull request as draft May 15, 2024 12:49

feat: Add xxhash64 function support

ebb3675

advancedxy force-pushed the xxhash_support branch from b6e42c3 to ebb3675 Compare May 27, 2024 14:59

Update related docs

b51dc84

advancedxy marked this pull request as ready for review May 28, 2024 01:14

advancedxy commented May 28, 2024

View reviewed changes

spark/inspections/CometTPCHQueriesList-results.txt Show resolved Hide resolved

viirya reviewed May 29, 2024

View reviewed changes

core/src/execution/datafusion/spark_hash.rs Outdated Show resolved Hide resolved

Update core/src/execution/datafusion/spark_hash.rs

a7c3c4a

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

kazuyukitanimura reviewed May 30, 2024

View reviewed changes

Update QueriesList results

bbc9284

andygrove approved these changes Jun 3, 2024

View reviewed changes

viirya approved these changes Jun 3, 2024

View reviewed changes

andygrove merged commit c79bd5c into apache:main Jun 3, 2024
43 checks passed

andygrove mentioned this pull request Jun 5, 2024

TPC-H q8 hangs with xxhash64 enabled #517

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add xxhash64 function support #424

feat: Add xxhash64 function support #424

advancedxy commented May 14, 2024

andygrove commented May 14, 2024

advancedxy commented May 14, 2024

andygrove commented May 14, 2024

andygrove commented May 14, 2024

advancedxy commented May 14, 2024

advancedxy commented May 14, 2024

andygrove commented May 14, 2024

andygrove commented May 14, 2024

advancedxy commented May 14, 2024

advancedxy commented May 14, 2024 •

edited

Loading

advancedxy commented May 15, 2024

advancedxy commented May 28, 2024

kazuyukitanimura May 30, 2024

advancedxy May 30, 2024

kazuyukitanimura May 30, 2024

advancedxy May 30, 2024

advancedxy Jun 1, 2024

advancedxy commented Jun 3, 2024

andygrove left a comment

advancedxy commented Jun 4, 2024 •

edited

Loading

feat: Add xxhash64 function support #424

feat: Add xxhash64 function support #424

Conversation

advancedxy commented May 14, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

andygrove commented May 14, 2024

advancedxy commented May 14, 2024

andygrove commented May 14, 2024

andygrove commented May 14, 2024

advancedxy commented May 14, 2024

advancedxy commented May 14, 2024

andygrove commented May 14, 2024

andygrove commented May 14, 2024

advancedxy commented May 14, 2024

advancedxy commented May 14, 2024 • edited Loading

advancedxy commented May 15, 2024

advancedxy commented May 28, 2024

kazuyukitanimura May 30, 2024

Choose a reason for hiding this comment

advancedxy May 30, 2024

Choose a reason for hiding this comment

kazuyukitanimura May 30, 2024

Choose a reason for hiding this comment

advancedxy May 30, 2024

Choose a reason for hiding this comment

advancedxy Jun 1, 2024

Choose a reason for hiding this comment

advancedxy commented Jun 3, 2024

andygrove left a comment

Choose a reason for hiding this comment

advancedxy commented Jun 4, 2024 • edited Loading

advancedxy commented May 14, 2024 •

edited

Loading

advancedxy commented Jun 4, 2024 •

edited

Loading