Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24586][SQL] Upcast should not allow casting from string to other types #21586

Closed
wants to merge 8 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jun 18, 2018

What changes were proposed in this pull request?

When turning a Dataset to another Dataset, Spark will up cast the fields in the original Dataset to the type of corresponding fields in the target DataSet.

However, the current upcast behavior is a little weird, we don't allow up casting from string to numeric, but allow non-numeric types as the target, like boolean, date, etc.

As a result, Seq("str").toDS.as[Int] fails, but Seq("str").toDS.as[Boolean] works and throw NPE during execution.

The motivation of the up cast is to prevent things like runtime NPE, it's more reasonable to make up cast stricter.

This PR does 2 things:

  1. rename Cast.canSafeCast to Cast.canUpcast, and support complex typres
  2. remove Cast.mayTruncate and replace it with !Cast.canUpcast

Note that, the up cast change also affects persistent view resolution. But since we don't support changing column types of an existing table, there is no behavior change here.

How was this patch tested?

new tests

@cloud-fan
Copy link
Contributor Author

cc @marmbrus @viirya @gatorsmile

@SparkQA
Copy link

SparkQA commented Jun 19, 2018

Test build #92052 has finished for PR 21586 at commit c89d12e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jun 19, 2018

retest this please

@@ -2299,7 +2299,7 @@ class Analyzer(
case u @ UpCast(child, _, _) if !child.resolved => u

case UpCast(child, dataType, walkedTypePath)
if Cast.mayTruncate(child.dataType, dataType) =>
if !Cast.canSafeCast(child.dataType, dataType) =>
fail(child, dataType, walkedTypePath)
Copy link
Member

@viirya viirya Jun 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the error message of thrown AnalysisException valid under this change? Seems it is not just because truncating?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

val fromPrecedence = TypeCoercion.numericPrecedence.indexOf(from)
val toPrecedence = TypeCoercion.numericPrecedence.indexOf(to)
toPrecedence > 0 && fromPrecedence > toPrecedence
fromPrecedence >= 0 && fromPrecedence < toPrecedence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: fromPrecedence != -1 && fromPrecedence < toPrecedence is more easy-to-understand?

case (TimestampType, DateType) => true
case (StringType, to: NumericType) => true
def canSafeCast(from: DataType, to: DataType): Boolean = (from, to) match {
case _ if from == to => true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems nullable of StructField can affect the casting decision?

Copy link
Contributor Author

@cloud-fan cloud-fan Apr 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added support for complex types

@SparkQA
Copy link

SparkQA commented Jun 19, 2018

Test build #92059 has finished for PR 21586 at commit c89d12e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jul 5, 2018
…is wrapped in Cast

## What changes were proposed in this pull request?

As mentioned in #21586 , `Cast.mayTruncate` is not 100% safe, string to boolean is allowed. Since changing `Cast.mayTruncate` also changes the behavior of Dataset, here I propose to add a new `Cast.canSafeCast` for partition pruning.

## How was this patch tested?

new test cases

Author: Wenchen Fan <wenchen@databricks.com>

Closes #21712 from cloud-fan/safeCast.
@SparkQA
Copy link

SparkQA commented Aug 3, 2018

Test build #94103 has finished for PR 21586 at commit c89d12e.

  • This patch fails due to an unknown error code, -9.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 11, 2019

Test build #104512 has finished for PR 21586 at commit 2a64d59.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 11, 2019

Test build #104510 has finished for PR 21586 at commit cde33ca.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 11, 2019

Test build #104518 has finished for PR 21586 at commit 93f43ad.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 15, 2019

Test build #104589 has finished for PR 21586 at commit 13a7846.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented May 15, 2019

Test build #105401 has finished for PR 21586 at commit 7c6d9a8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 15, 2019

Test build #105398 has finished for PR 21586 at commit 13a7846.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 15, 2019

Test build #105400 has finished for PR 21586 at commit a199906.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 15, 2019

Test build #105408 has finished for PR 21586 at commit cf8d05f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
def mayTruncate(from: DataType, to: DataType): Boolean = (from, to) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this will be ugly, but I still prefer to adding a conf for falling back to the original behavior.

@SparkQA
Copy link

SparkQA commented May 17, 2019

Test build #105490 has finished for PR 21586 at commit 14e3de3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 17, 2019

Test build #105486 has finished for PR 21586 at commit e558ad4.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2019

Test build #105551 has finished for PR 21586 at commit a638a53.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented May 21, 2019

Test build #105598 has finished for PR 21586 at commit a638a53.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented May 21, 2019

Test build #105618 has finished for PR 21586 at commit a638a53.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks, merging to master!

@cloud-fan cloud-fan closed this in 03c9e8a May 22, 2019
@dongjoon-hyun
Copy link
Member

cc @dbtsai

@@ -138,6 +138,8 @@ license: |
need to specify a value with units like "30s" now, to avoid being interpreted as milliseconds; otherwise,
the extremely short interval that results will likely cause applications to fail.

- When turning a Dataset to another Dataset, Spark will up cast the fields in the original Dataset to the type of corresponding fields in the target DataSet. In version 2.4 and earlier, this up cast is not very strict, e.g. `Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and throw NPE during execution. In Spark 3.0, the up cast is stricter and turning String into something else is not allowed, i.e. `Seq("str").toDS.as[Boolean]` will fail during analysis.
Copy link
Contributor

@manuzhang manuzhang Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan shall we mention setting spark.sql.legacy.looseUpcast=true to preserve old behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you interested in creating a PR (with a new JIRA) to fix the doc? 2.4 is EOL but I think it's good to fix the docs anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will create one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#35519 has been created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants