Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32136][SQL] NormalizeFloatingNumbers should work on null struct #28962

Closed
wants to merge 2 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Jul 1, 2020

What changes were proposed in this pull request?

This patch fixes wrong groupBy result if the grouping key is a null-value struct.

Why are the changes needed?

NormalizeFloatingNumbers reconstructs a struct if input expression is StructType. If the input struct is null, it will reconstruct a struct with null-value fields, instead of null.

Does this PR introduce any user-facing change?

Yes, fixing incorrect groupBy result.

How was this patch tested?

Unit test.

@viirya
Copy link
Member Author

viirya commented Jul 1, 2020

cc @cloud-fan @HyukjinKwon

@dongjoon-hyun
Copy link
Member

Thank you so much, @viirya .

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch and thanks for the quick fix!

@@ -123,7 +123,8 @@ object NormalizeFloatingNumbers extends Rule[LogicalPlan] {
val fields = expr.dataType.asInstanceOf[StructType].fields.indices.map { i =>
normalize(GetStructField(expr, i))
}
CreateStruct(fields)
val struct = CreateStruct(fields)
If(IsNull(expr), Literal(null, struct.dataType), struct)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wondering if we need a new Literal here. Maybe, can we simple just put expr?

- val struct = CreateStruct(fields)
- If(IsNull(expr), Literal(null, struct.dataType), struct)
+ If(IsNull(expr), expr, CreateStruct(fields))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AS-IS also looks good to me, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

literal is better here to simplify the expression tree, and probably has better perf.

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124746 has finished for PR 28962 at commit 41a318e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@cloud-fan
Copy link
Contributor

The test failure is legitimate, can you take a look? Seems we need to skip the NormalizeFloatingNumbers rule if the input expression contains KnownFloatingPointNormalized

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124799 has finished for PR 28962 at commit d0586c9.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jul 1, 2020

retest this please

@@ -123,7 +123,8 @@ object NormalizeFloatingNumbers extends Rule[LogicalPlan] {
val fields = expr.dataType.asInstanceOf[StructType].fields.indices.map { i =>
normalize(GetStructField(expr, i))
}
CreateStruct(fields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this works before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateStruct re-creates the struct when NormalizeFloatingNumbers is run. The original expr is not used. But now we wrap expr in IsNull.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, got it!

This reminds me that we can optimize the code a bit: Instead of only matching CreateNamedStruct, we can have a whitelist to apply the normalization to the children. For example, If(..., CreateNamedStruct(...)), we should apply normalization to the inner CreateNamedStruct.

We can do this later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For If(..., CreateNamedStruct(...)), it is struct type so already covered by the StructType case. So the normalization is already applied to the inner CreateNamedStruct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea I know, but I mean we can do better. We can directly apply the normalization to the children in If(..., CreateNamedStruct(children)), instead of treating the outer If as a black box and apply normalization to it.

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124771 has finished for PR 28962 at commit 41a318e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

It seems that we hit another relevant failure.

[info] - NaN and -0.0 in join keys *** FAILED *** (1 second, 385 milliseconds)
[info]   org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
Once strategy's idempotence is broken for batch NormalizeFloatingNumbers

@viirya
Copy link
Member Author

viirya commented Jul 1, 2020

No, it was triggered before the last commit.

@dongjoon-hyun
Copy link
Member

Oh, my bad. Got it~

@SparkQA
Copy link

SparkQA commented Jul 1, 2020

Test build #124803 has finished for PR 28962 at commit d0586c9.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

PySpark failure looks irrelevant.

FAIL: test_train_prediction (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
Test that error on test data improves as model is trained.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 466, in test_train_prediction
    eventually(condition, timeout=180.0)
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/testing/utils.py", line 81, in eventually
    lastValue = condition()
  File "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 461, in condition
    self.assertGreater(errors[1] - errors[-1], 2)
AssertionError: 1.672640157855923 not greater than 2

@viirya
Copy link
Member Author

viirya commented Jul 1, 2020

retest this please...

@SparkQA
Copy link

SparkQA commented Jul 2, 2020

Test build #124829 has finished for PR 28962 at commit d0586c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master and branch-3.0.

HyukjinKwon pushed a commit that referenced this pull request Jul 2, 2020
### What changes were proposed in this pull request?

This patch fixes wrong groupBy result if the grouping key is a null-value struct.

### Why are the changes needed?

`NormalizeFloatingNumbers` reconstructs a struct if input expression is StructType. If the input struct is null, it will reconstruct a struct with null-value fields, instead of null.

### Does this PR introduce _any_ user-facing change?

Yes, fixing incorrect groupBy result.

### How was this patch tested?

Unit test.

Closes #28962 from viirya/SPARK-32136.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 3f7780d)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@viirya
Copy link
Member Author

viirya commented Jul 2, 2020

Thanks all!

dongjoon-hyun pushed a commit that referenced this pull request Jul 12, 2020
…seWhen/Coalesce child expressions

### What changes were proposed in this pull request?

This patch proposes to let `NormalizeFloatingNumbers` rule directly normalizes on certain children expressions. It could simplify expression tree.

### Why are the changes needed?

Currently NormalizeFloatingNumbers rule treats some expressions as black box but we can optimize it a bit by normalizing directly the inner children expressions.

Also see #28962 (comment).

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests.

Closes #29061 from viirya/SPARK-32258.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@viirya viirya deleted the SPARK-32136 branch December 27, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants