-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ #30673
Conversation
@@ -554,4 +554,94 @@ class PredicateSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
checkEvaluation(GreaterThan(Literal(Float.NaN), Literal(Float.NaN)), false) | |||
checkEvaluation(GreaterThan(Literal(0.0F), Literal(-0.0F)), false) | |||
} | |||
|
|||
test("SPARK-32110: compare special double/float values in array") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new tests here pass before this PR. I'm adding them to prove that nested 0.0/-0.0 is fine. CodegenContext.genComp
is very conservative and only does the shortcut when both sides are unsafe format and they equal to each other in binary. for 0.0 and -0.0, they do not equal to each other in binary and will fallback to the element-by-element comparison.
evaluateEstimate(hll, buffer, 1); | ||
} | ||
|
||
test("SPARK-32110: add NaN") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test passes before this PR, as our hash implementation returns the same hash code for all NaN values. I'm adding it just to make the test cases completed.
Kubernetes integration test starting |
@@ -143,6 +143,28 @@ object NormalizeFloatingNumbers extends Rule[LogicalPlan] { | |||
|
|||
case _ => throw new IllegalStateException(s"fail to normalize $expr") | |||
} | |||
|
|||
val FLOAT_NORMALIZER: Any => Any = (input: Any) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, can this just be a def
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is stateless and being a val is more efficient.
|
||
val FLOAT_NORMALIZER: Any => Any = (input: Any) => { | ||
val f = input.asInstanceOf[Float] | ||
if (f.isNaN) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this check isn't necessary, as NaN won't equal -0.0f, so it will be returned on line 154 anyway. Or am I missing that there are different NaNs and this is normalizing them too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copied from the existing code. NaN is not a single value and we need to make sure all NaN values have the same binary representation in Spark unsafe row.
Kubernetes integration test status success |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #132434 has finished for PR 30673 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
The test failure seems to be flaky tests now. |
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you for adding extensive test coverage.
I also checked that test("SPARK-32110: add 0.0 and -0.0")
ensures this patch.
Merged to master/3.1/3.0.
The CI failures are known and irrelevant to this one.
### What changes were proposed in this pull request? Currently, Spark treats 0.0 and -0.0 semantically equal, while it still retains the difference between them so that users can see -0.0 when displaying the data set. The comparison expressions in Spark take care of the special floating numbers and implement the correct semantic. However, Spark doesn't always use these comparison expressions to compare values, and we need to normalize the special floating numbers before comparing them in these places: 1. GROUP BY 2. join keys 3. window partition keys This PR fixes one more place that compares values without using comparison expressions: HyperLogLog++ ### Why are the changes needed? Fix the query result ### Does this PR introduce _any_ user-facing change? Yes, the result of HyperLogLog++ becomes correct now. ### How was this patch tested? a new test case, and a few more test cases that pass before this PR to improve test coverage. Closes #30673 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6fd2345) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? Currently, Spark treats 0.0 and -0.0 semantically equal, while it still retains the difference between them so that users can see -0.0 when displaying the data set. The comparison expressions in Spark take care of the special floating numbers and implement the correct semantic. However, Spark doesn't always use these comparison expressions to compare values, and we need to normalize the special floating numbers before comparing them in these places: 1. GROUP BY 2. join keys 3. window partition keys This PR fixes one more place that compares values without using comparison expressions: HyperLogLog++ ### Why are the changes needed? Fix the query result ### Does this PR introduce _any_ user-facing change? Yes, the result of HyperLogLog++ becomes correct now. ### How was this patch tested? a new test case, and a few more test cases that pass before this PR to improve test coverage. Closes #30673 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 6fd2345) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Thank you, @cloud-fan , @srowen , @viirya . |
Test build #132447 has finished for PR 30673 at commit
|
late lgtm. |
What changes were proposed in this pull request?
Currently, Spark treats 0.0 and -0.0 semantically equal, while it still retains the difference between them so that users can see -0.0 when displaying the data set.
The comparison expressions in Spark take care of the special floating numbers and implement the correct semantic. However, Spark doesn't always use these comparison expressions to compare values, and we need to normalize the special floating numbers before comparing them in these places:
This PR fixes one more place that compares values without using comparison expressions: HyperLogLog++
Why are the changes needed?
Fix the query result
Does this PR introduce any user-facing change?
Yes, the result of HyperLogLog++ becomes correct now.
How was this patch tested?
a new test case, and a few more test cases that pass before this PR to improve test coverage.