Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26233][SQL] CheckOverflow when encoding a decimal value #23210

Closed
wants to merge 1 commit into from

Conversation

mgaido91
Copy link
Contributor

@mgaido91 mgaido91 commented Dec 3, 2018

What changes were proposed in this pull request?

When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations.

How was this patch tested?

added UT

@SparkQA
Copy link

SparkQA commented Dec 3, 2018

Test build #99621 has finished for PR 23210 at commit 91d3e1b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1647,6 +1647,15 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
checkDataset(ds, data: _*)
checkAnswer(ds.select("x"), Seq(Row(1), Row(2)))
}

test("SPARK-26233: serializer should enforce decimal precision and scale") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test case in RowEncoderSuite, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, everything is possible, but it is not easy actually. Because the issue here happens in the codegen, not when we retrieve the output. So if we just encode and decode everything is fine. The problem happens if there is any transformation in the codegen meanwhile, because there the underlying decimal is used (assuming that it has the same precision and scale of the data type - which without the current change is not always true). I tried checking the precision and scale of the serialized object, but it is not really feasible as they are converted when it is read (please see UnsafeRow)... So I'd avoid this actually.

@dongjoon-hyun
Copy link
Member

LGTM except one minor comment about a new test case.

Ping, @gatorsmile and @cloud-fan . Could you review this PR? This is to fix a correctness issue.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.
It seems that there is no other comments, I'll merge this to master.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 4, 2018

Thank you, @mgaido91 . This needs to land branch-2.4/branch-2.3/branch-2.2, but it fails at branch-2.4 due to the conflicts in the test case file. Could you make separate backport PRs for each branch?

@asfgit asfgit closed this in 556d83e Dec 4, 2018
@cloud-fan
Copy link
Contributor

a late LGTM

mgaido91 added a commit to mgaido91/spark that referenced this pull request Dec 5, 2018
When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations.

added UT

Closes apache#23210 from mgaido91/SPARK-26233.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
mgaido91 added a commit to mgaido91/spark that referenced this pull request Dec 5, 2018
When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations.

added UT

Closes apache#23210 from mgaido91/SPARK-26233.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
mgaido91 added a commit to mgaido91/spark that referenced this pull request Dec 5, 2018
When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations.

added UT

Closes apache#23210 from mgaido91/SPARK-26233.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@mgaido91
Copy link
Contributor Author

mgaido91 commented Dec 5, 2018

thanks @cloud-fan @dongjoon-hyun, I created the PRs for the backports.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations.

## How was this patch tested?

added UT

Closes apache#23210 from mgaido91/SPARK-26233.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants