[SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ #30673

cloud-fan · 2020-12-08T13:57:56Z

What changes were proposed in this pull request?

Currently, Spark treats 0.0 and -0.0 semantically equal, while it still retains the difference between them so that users can see -0.0 when displaying the data set.

The comparison expressions in Spark take care of the special floating numbers and implement the correct semantic. However, Spark doesn't always use these comparison expressions to compare values, and we need to normalize the special floating numbers before comparing them in these places:

GROUP BY
join keys
window partition keys

This PR fixes one more place that compares values without using comparison expressions: HyperLogLog++

Why are the changes needed?

Fix the query result

Does this PR introduce any user-facing change?

Yes, the result of HyperLogLog++ becomes correct now.

How was this patch tested?

a new test case, and a few more test cases that pass before this PR to improve test coverage.

cloud-fan · 2020-12-08T14:02:34Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/PredicateSuite.scala

@@ -554,4 +554,94 @@ class PredicateSuite extends SparkFunSuite with ExpressionEvalHelper {
    checkEvaluation(GreaterThan(Literal(Float.NaN), Literal(Float.NaN)), false)
    checkEvaluation(GreaterThan(Literal(0.0F), Literal(-0.0F)), false)
  }
+
+  test("SPARK-32110: compare special double/float values in array") {


The new tests here pass before this PR. I'm adding them to prove that nested 0.0/-0.0 is fine. CodegenContext.genComp is very conservative and only does the shortcut when both sides are unsafe format and they equal to each other in binary. for 0.0 and -0.0, they do not equal to each other in binary and will fallback to the element-by-element comparison.

cloud-fan · 2020-12-08T14:03:39Z

...est/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala

+    evaluateEstimate(hll, buffer, 1);
+  }
+
+  test("SPARK-32110: add NaN") {


This test passes before this PR, as our hash implementation returns the same hash code for all NaN values. I'm adding it just to make the test cases completed.

cloud-fan · 2020-12-08T14:04:04Z

cc @maropu @viirya @dongjoon-hyun

SparkQA · 2020-12-08T14:41:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37034/

srowen · 2020-12-08T14:50:31Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala

@@ -143,6 +143,28 @@ object NormalizeFloatingNumbers extends Rule[LogicalPlan] {

    case _ => throw new IllegalStateException(s"fail to normalize $expr")
  }
+
+  val FLOAT_NORMALIZER: Any => Any = (input: Any) => {


Not sure, can this just be a def?

This is stateless and being a val is more efficient.

srowen · 2020-12-08T14:51:14Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala

+
+  val FLOAT_NORMALIZER: Any => Any = (input: Any) => {
+    val f = input.asInstanceOf[Float]
+    if (f.isNaN) {


I think this check isn't necessary, as NaN won't equal -0.0f, so it will be returned on line 154 anyway. Or am I missing that there are different NaNs and this is normalizing them too?

This is copied from the existing code. NaN is not a single value and we need to make sure all NaN values have the same binary representation in Spark unsafe row.

SparkQA · 2020-12-08T15:08:27Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37034/

SparkQA · 2020-12-08T15:22:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37035/

SparkQA · 2020-12-08T15:48:04Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37035/

SparkQA · 2020-12-08T19:00:49Z

Test build #132434 has finished for PR 30673 at commit 291e652.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

lgtm

viirya · 2020-12-08T19:28:45Z

The test failure seems to be flaky tests now.

viirya · 2020-12-08T19:28:53Z

retest this please

dongjoon-hyun

+1, LGTM. Thank you for adding extensive test coverage.
I also checked that test("SPARK-32110: add 0.0 and -0.0") ensures this patch.
Merged to master/3.1/3.0.

The CI failures are known and irrelevant to this one.

### What changes were proposed in this pull request? Currently, Spark treats 0.0 and -0.0 semantically equal, while it still retains the difference between them so that users can see -0.0 when displaying the data set. The comparison expressions in Spark take care of the special floating numbers and implement the correct semantic. However, Spark doesn't always use these comparison expressions to compare values, and we need to normalize the special floating numbers before comparing them in these places: 1. GROUP BY 2. join keys 3. window partition keys This PR fixes one more place that compares values without using comparison expressions: HyperLogLog++ ### Why are the changes needed? Fix the query result ### Does this PR introduce _any_ user-facing change? Yes, the result of HyperLogLog++ becomes correct now. ### How was this patch tested? a new test case, and a few more test cases that pass before this PR to improve test coverage. Closes #30673 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6fd2345) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2020-12-08T19:48:45Z

Thank you, @cloud-fan , @srowen , @viirya .

SparkQA · 2020-12-08T21:54:10Z

Test build #132447 has finished for PR 30673 at commit 291e652.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-12-09T07:29:53Z

late lgtm.

github-actions bot added the SQL label Dec 8, 2020

normalize special floating numbers in HyperLogLog++

291e652

cloud-fan force-pushed the bug branch from ba899bd to 291e652 Compare December 8, 2020 13:59

cloud-fan commented Dec 8, 2020

View reviewed changes

srowen reviewed Dec 8, 2020

View reviewed changes

viirya approved these changes Dec 8, 2020

View reviewed changes

dongjoon-hyun approved these changes Dec 8, 2020

View reviewed changes

dongjoon-hyun closed this in 6fd2345 Dec 8, 2020

cutecycle mentioned this pull request May 10, 2022

Feature/decimal support dotnet/spark#982

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ #30673

[SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ #30673

cloud-fan commented Dec 8, 2020

cloud-fan Dec 8, 2020

cloud-fan Dec 8, 2020

cloud-fan commented Dec 8, 2020

SparkQA commented Dec 8, 2020

srowen Dec 8, 2020

cloud-fan Dec 8, 2020

srowen Dec 8, 2020

cloud-fan Dec 8, 2020

SparkQA commented Dec 8, 2020

SparkQA commented Dec 8, 2020

SparkQA commented Dec 8, 2020

SparkQA commented Dec 8, 2020

viirya left a comment

viirya commented Dec 8, 2020

viirya commented Dec 8, 2020

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun commented Dec 8, 2020

SparkQA commented Dec 8, 2020

maropu commented Dec 9, 2020

[SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ #30673

[SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ #30673

Conversation

cloud-fan commented Dec 8, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan Dec 8, 2020

Choose a reason for hiding this comment

cloud-fan Dec 8, 2020

Choose a reason for hiding this comment

cloud-fan commented Dec 8, 2020

SparkQA commented Dec 8, 2020

srowen Dec 8, 2020

Choose a reason for hiding this comment

cloud-fan Dec 8, 2020

Choose a reason for hiding this comment

srowen Dec 8, 2020

Choose a reason for hiding this comment

cloud-fan Dec 8, 2020

Choose a reason for hiding this comment

SparkQA commented Dec 8, 2020

SparkQA commented Dec 8, 2020

SparkQA commented Dec 8, 2020

SparkQA commented Dec 8, 2020

viirya left a comment

Choose a reason for hiding this comment

viirya commented Dec 8, 2020

viirya commented Dec 8, 2020

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 8, 2020

SparkQA commented Dec 8, 2020

maropu commented Dec 9, 2020

dongjoon-hyun left a comment •

edited

Loading