-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24586][SQL] Upcast should not allow casting from string to other types #21586
Conversation
Test build #92052 has finished for PR 21586 at commit
|
retest this please |
@@ -2299,7 +2299,7 @@ class Analyzer( | |||
case u @ UpCast(child, _, _) if !child.resolved => u | |||
|
|||
case UpCast(child, dataType, walkedTypePath) | |||
if Cast.mayTruncate(child.dataType, dataType) => | |||
if !Cast.canSafeCast(child.dataType, dataType) => | |||
fail(child, dataType, walkedTypePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the error message of thrown AnalysisException
valid under this change? Seems it is not just because truncating?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
val fromPrecedence = TypeCoercion.numericPrecedence.indexOf(from) | ||
val toPrecedence = TypeCoercion.numericPrecedence.indexOf(to) | ||
toPrecedence > 0 && fromPrecedence > toPrecedence | ||
fromPrecedence >= 0 && fromPrecedence < toPrecedence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: fromPrecedence != -1 && fromPrecedence < toPrecedence
is more easy-to-understand?
case (TimestampType, DateType) => true | ||
case (StringType, to: NumericType) => true | ||
def canSafeCast(from: DataType, to: DataType): Boolean = (from, to) match { | ||
case _ if from == to => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems nullable
of StructField
can affect the casting decision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added support for complex types
Test build #92059 has finished for PR 21586 at commit
|
…is wrapped in Cast ## What changes were proposed in this pull request? As mentioned in #21586 , `Cast.mayTruncate` is not 100% safe, string to boolean is allowed. Since changing `Cast.mayTruncate` also changes the behavior of Dataset, here I propose to add a new `Cast.canSafeCast` for partition pruning. ## How was this patch tested? new test cases Author: Wenchen Fan <wenchen@databricks.com> Closes #21712 from cloud-fan/safeCast.
Test build #94103 has finished for PR 21586 at commit
|
cde33ca
to
2a64d59
Compare
Test build #104512 has finished for PR 21586 at commit
|
Test build #104510 has finished for PR 21586 at commit
|
Test build #104518 has finished for PR 21586 at commit
|
Test build #104589 has finished for PR 21586 at commit
|
retest this please |
Test build #105401 has finished for PR 21586 at commit
|
Test build #105398 has finished for PR 21586 at commit
|
Test build #105400 has finished for PR 21586 at commit
|
Test build #105408 has finished for PR 21586 at commit
|
*/ | ||
def mayTruncate(from: DataType, to: DataType): Boolean = (from, to) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this will be ugly, but I still prefer to adding a conf for falling back to the original behavior.
Test build #105490 has finished for PR 21586 at commit
|
Test build #105486 has finished for PR 21586 at commit
|
Test build #105551 has finished for PR 21586 at commit
|
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Test build #105598 has finished for PR 21586 at commit
|
retest this please |
Test build #105618 has finished for PR 21586 at commit
|
thanks, merging to master! |
cc @dbtsai |
@@ -138,6 +138,8 @@ license: | | |||
need to specify a value with units like "30s" now, to avoid being interpreted as milliseconds; otherwise, | |||
the extremely short interval that results will likely cause applications to fail. | |||
|
|||
- When turning a Dataset to another Dataset, Spark will up cast the fields in the original Dataset to the type of corresponding fields in the target DataSet. In version 2.4 and earlier, this up cast is not very strict, e.g. `Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and throw NPE during execution. In Spark 3.0, the up cast is stricter and turning String into something else is not allowed, i.e. `Seq("str").toDS.as[Boolean]` will fail during analysis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan shall we mention setting spark.sql.legacy.looseUpcast=true
to preserve old behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you interested in creating a PR (with a new JIRA) to fix the doc? 2.4 is EOL but I think it's good to fix the docs anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will create one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#35519 has been created.
What changes were proposed in this pull request?
When turning a Dataset to another Dataset, Spark will up cast the fields in the original Dataset to the type of corresponding fields in the target DataSet.
However, the current upcast behavior is a little weird, we don't allow up casting from string to numeric, but allow non-numeric types as the target, like boolean, date, etc.
As a result,
Seq("str").toDS.as[Int]
fails, butSeq("str").toDS.as[Boolean]
works and throw NPE during execution.The motivation of the up cast is to prevent things like runtime NPE, it's more reasonable to make up cast stricter.
This PR does 2 things:
Cast.canSafeCast
toCast.canUpcast
, and support complex typresCast.mayTruncate
and replace it with!Cast.canUpcast
Note that, the up cast change also affects persistent view resolution. But since we don't support changing column types of an existing table, there is no behavior change here.
How was this patch tested?
new tests