Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33524][SQL][TESTS] Change InMemoryTable not to use Tuple.hashCode for BucketTransform #30477

Closed
wants to merge 2 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Nov 24, 2020

What changes were proposed in this pull request?

This PR aims to change InMemoryTable not to use Tuple.hashCode for BucketTransform.

Why are the changes needed?

SPARK-32168 made InMemoryTable to handle BucketTransform as a hash of Tuple which is dependents on Scala versions.

Scala 2.12.10

$ bin/scala
Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272).
Type in expressions for evaluation. Or try :help.

scala> (1, 1).hashCode
res0: Int = -2074071657

Scala 2.13.3

Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_272).
Type in expressions for evaluation. Or try :help.

scala> (1, 1).hashCode
val res0: Int = -1669302457

Does this PR introduce any user-facing change?

Yes. This is a correctness issue.

How was this patch tested?

Pass the UT with both Scala 2.12/2.13.

@github-actions github-actions bot added the SQL label Nov 24, 2020
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-33524][SQL] [SPARK-33524][SQL] Change BucketTransform not to use Tuple.hashCode. Nov 24, 2020
Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for tracking this down, @dongjoon-hyun!

@dongjoon-hyun
Copy link
Member Author

Thank you, @rdblue !

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-33524][SQL] Change BucketTransform not to use Tuple.hashCode. [SPARK-33524][SQL][TESTS] Change BucketTransform not to use Tuple.hashCode. Nov 24, 2020
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-33524][SQL][TESTS] Change BucketTransform not to use Tuple.hashCode. [SPARK-33524][SQL][TESTS] Change InMemoryTable not to use Tuple.hashCode for BucketTransform Nov 24, 2020
(extractor(ref.fieldNames, schema, row).hashCode() & Integer.MAX_VALUE) % numBuckets
val (value, dataType) = extractor(ref.fieldNames, schema, row)
val valueHashCode = if (value == null) 0 else value.hashCode
((valueHashCode + dataType.hashCode()) & Integer.MAX_VALUE) % numBuckets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine. One common hashCode pattern is a + 31 * b, in the JVM source, FWIW.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, @srowen . I'll update like that.

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131576 has finished for PR 30477 at commit c5a0b06.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The last change affects only DataSourceV2SQLSuite. I manually verified it.

$ build/sbt "sql/testOnly *.DataSourceV2SQLSuite"
...
[info] - SPARK-31255: Project a metadata column (90 milliseconds)
[info] - SPARK-31255: Projects data column when metadata column has the same name (78 milliseconds)
[info] - SPARK-31255: * expansion does not include metadata columns (67 milliseconds)
[info] - SPARK-33505: insert into partitioned table (57 milliseconds)
19:31:17.051 WARN org.apache.spark.sql.connector.DataSourceV2SQLSuite:

===== POSSIBLE THREAD LEAK IN SUITE o.a.s.sql.connector.DataSourceV2SQLSuite, thread names: rpc-boss-3-1, shuffle-boss-6-1 =====
[info] ScalaTest
[info] Run completed in 26 seconds, 773 milliseconds.
[info] Total number of tests run: 223
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 223, failed 0, canceled 0, ignored 2, pending 0
[info] All tests passed.
[info] Passed: Total 223, Failed 0, Errors 0, Passed 223, Ignored 2
[success] Total time: 226 s (03:46), completed Nov 23, 2020 7:31:17 PM

@dongjoon-hyun
Copy link
Member Author

I'll merge this. Thank you, @rdblue , @HyukjinKwon , @srowen .

dongjoon-hyun added a commit that referenced this pull request Nov 24, 2020
…hCode for `BucketTransform`

This PR aims to change `InMemoryTable` not to use `Tuple.hashCode` for `BucketTransform`.

SPARK-32168 made `InMemoryTable` to handle `BucketTransform` as a hash of `Tuple` which is dependents on Scala versions.
- https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala#L159

**Scala 2.12.10**
```scala
$ bin/scala
Welcome to Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272).
Type in expressions for evaluation. Or try :help.

scala> (1, 1).hashCode
res0: Int = -2074071657
```

**Scala 2.13.3**
```scala
Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_272).
Type in expressions for evaluation. Or try :help.

scala> (1, 1).hashCode
val res0: Int = -1669302457
```

Yes. This is a correctness issue.

Pass the UT with both Scala 2.12/2.13.

Closes #30477 from dongjoon-hyun/SPARK-33524.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 8380e00)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun dongjoon-hyun deleted the SPARK-33524 branch November 24, 2020 03:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants