[SPARK-33524][SQL][TESTS] Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform` #30477

dongjoon-hyun · 2020-11-24T00:07:54Z

What changes were proposed in this pull request?

This PR aims to change InMemoryTable not to use Tuple.hashCode for BucketTransform.

Why are the changes needed?

SPARK-32168 made InMemoryTable to handle BucketTransform as a hash of Tuple which is dependents on Scala versions.

https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala#L159

Scala 2.12.10

$ bin/scala
Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272).
Type in expressions for evaluation. Or try :help.

scala> (1, 1).hashCode
res0: Int = -2074071657

Scala 2.13.3

Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_272).
Type in expressions for evaluation. Or try :help.

scala> (1, 1).hashCode
val res0: Int = -1669302457

Does this PR introduce any user-facing change?

Yes. This is a correctness issue.

How was this patch tested?

Pass the UT with both Scala 2.12/2.13.

rdblue

Looks good to me. Thanks for tracking this down, @dongjoon-hyun!

dongjoon-hyun · 2020-11-24T00:31:29Z

Thank you, @rdblue !

srowen · 2020-11-24T01:32:29Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala

-        (extractor(ref.fieldNames, schema, row).hashCode() & Integer.MAX_VALUE) % numBuckets
+        val (value, dataType) = extractor(ref.fieldNames, schema, row)
+        val valueHashCode = if (value == null) 0 else value.hashCode
+        ((valueHashCode + dataType.hashCode()) & Integer.MAX_VALUE) % numBuckets


Seems fine. One common hashCode pattern is a + 31 * b, in the JVM source, FWIW.

Got it, @srowen . I'll update like that.

SparkQA · 2020-11-24T03:33:27Z

Test build #131576 has finished for PR 30477 at commit c5a0b06.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-11-24T03:35:08Z

The last change affects only DataSourceV2SQLSuite. I manually verified it.

$ build/sbt "sql/testOnly *.DataSourceV2SQLSuite"
...
[info] - SPARK-31255: Project a metadata column (90 milliseconds)
[info] - SPARK-31255: Projects data column when metadata column has the same name (78 milliseconds)
[info] - SPARK-31255: * expansion does not include metadata columns (67 milliseconds)
[info] - SPARK-33505: insert into partitioned table (57 milliseconds)
19:31:17.051 WARN org.apache.spark.sql.connector.DataSourceV2SQLSuite:

===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.connector.DataSourceV2SQLSuite, thread names: rpc-boss-3-1, shuffle-boss-6-1 =====
[info] ScalaTest
[info] Run completed in 26 seconds, 773 milliseconds.
[info] Total number of tests run: 223
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 223, failed 0, canceled 0, ignored 2, pending 0
[info] All tests passed.
[info] Passed: Total 223, Failed 0, Errors 0, Passed 223, Ignored 2
[success] Total time: 226 s (03:46), completed Nov 23, 2020 7:31:17 PM

dongjoon-hyun · 2020-11-24T03:35:37Z

I'll merge this. Thank you, @rdblue , @HyukjinKwon , @srowen .

…hCode for `BucketTransform` This PR aims to change `InMemoryTable` not to use `Tuple.hashCode` for `BucketTransform`. SPARK-32168 made `InMemoryTable` to handle `BucketTransform` as a hash of `Tuple` which is dependents on Scala versions. - https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala#L159 **Scala 2.12.10** ```scala $ bin/scala Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272). Type in expressions for evaluation. Or try :help. scala> (1, 1).hashCode res0: Int = -2074071657 ``` **Scala 2.13.3** ```scala Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_272). Type in expressions for evaluation. Or try :help. scala> (1, 1).hashCode val res0: Int = -1669302457 ``` Yes. This is a correctness issue. Pass the UT with both Scala 2.12/2.13. Closes #30477 from dongjoon-hyun/SPARK-33524. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 8380e00) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

github-actions bot added the SQL label Nov 24, 2020

[SPARK-33524][SQL]

c5a0b06

dongjoon-hyun changed the title ~~[SPARK-33524][SQL]~~ [SPARK-33524][SQL] Change BucketTransform not to use Tuple.hashCode. Nov 24, 2020

dongjoon-hyun mentioned this pull request Nov 24, 2020

[SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 #28027

Closed

dongjoon-hyun requested review from rdblue and srowen November 24, 2020 00:19

rdblue approved these changes Nov 24, 2020

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-33524][SQL] Change BucketTransform not to use Tuple.hashCode.~~ [SPARK-33524][SQL][TESTS] Change BucketTransform not to use Tuple.hashCode. Nov 24, 2020

dongjoon-hyun changed the title ~~[SPARK-33524][SQL][TESTS] Change BucketTransform not to use Tuple.hashCode.~~ [SPARK-33524][SQL][TESTS] Change InMemoryTable not to use Tuple.hashCode for BucketTransform Nov 24, 2020

HyukjinKwon approved these changes Nov 24, 2020

View reviewed changes

srowen reviewed Nov 24, 2020

View reviewed changes

Address comments

74a50f5

dongjoon-hyun closed this in 8380e00 Nov 24, 2020

dongjoon-hyun deleted the SPARK-33524 branch November 24, 2020 03:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33524][SQL][TESTS] Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform` #30477

[SPARK-33524][SQL][TESTS] Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform` #30477

dongjoon-hyun commented Nov 24, 2020 •

edited

Loading

rdblue left a comment

dongjoon-hyun commented Nov 24, 2020

srowen Nov 24, 2020

dongjoon-hyun Nov 24, 2020

SparkQA commented Nov 24, 2020

dongjoon-hyun commented Nov 24, 2020

dongjoon-hyun commented Nov 24, 2020

[SPARK-33524][SQL][TESTS] Change InMemoryTable not to use Tuple.hashCode for BucketTransform #30477

[SPARK-33524][SQL][TESTS] Change InMemoryTable not to use Tuple.hashCode for BucketTransform #30477

Conversation

dongjoon-hyun commented Nov 24, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

rdblue left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 24, 2020

srowen Nov 24, 2020

Choose a reason for hiding this comment

dongjoon-hyun Nov 24, 2020

Choose a reason for hiding this comment

SparkQA commented Nov 24, 2020

dongjoon-hyun commented Nov 24, 2020

dongjoon-hyun commented Nov 24, 2020

[SPARK-33524][SQL][TESTS] Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform` #30477

[SPARK-33524][SQL][TESTS] Change `InMemoryTable` not to use Tuple.hashCode for `BucketTransform` #30477

dongjoon-hyun commented Nov 24, 2020 •

edited

Loading