Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47569][SQL] Disallow comparing variant. #45726

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions common/utils/src/main/resources/error/error-classes.json
Original file line number Diff line number Diff line change
Expand Up @@ -1390,6 +1390,12 @@
],
"sqlState" : "42805"
},
"GROUP_EXPRESSION_TYPE_IS_NOT_ORDERABLE" : {
"message" : [
"The expression <sqlExpr> cannot be used as a grouping expression because its data type <dataType> is not an orderable data type."
],
"sqlState" : "42822"
},
"HLL_INVALID_INPUT_SKETCH_BUFFER" : {
"message" : [
"Invalid call to <function>; only valid HLL sketch buffers are supported as inputs (such as those produced by the `hll_sketch_agg` function)."
Expand Down
6 changes: 6 additions & 0 deletions docs/sql-error-conditions.md
Original file line number Diff line number Diff line change
Expand Up @@ -846,6 +846,12 @@ GROUP BY `<index>` refers to an expression `<aggExpr>` that contains an aggregat

GROUP BY position `<index>` is not in select list (valid range is [1, `<size>`]).

### GROUP_EXPRESSION_TYPE_IS_NOT_ORDERABLE

[SQLSTATE: 42822](sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation)

The expression `<sqlExpr>` cannot be used as a grouping expression because its data type `<dataType>` is not an orderable data type.

### HLL_INVALID_INPUT_SKETCH_BUFFER

[SQLSTATE: 22546](sql-error-conditions-sqlstates.html#class-22-data-exception)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,15 @@
*/
package org.apache.spark.sql.catalyst.expressions

import org.apache.spark.sql.types.{ArrayType, AtomicType, DataType, NullType, StructType, UserDefinedType}
import org.apache.spark.sql.types.{ArrayType, AtomicType, DataType, NullType, StructType, UserDefinedType, VariantType}

object OrderUtils {
/**
* Returns true iff the data type can be ordered (i.e. can be sorted).
*/
def isOrderable(dataType: DataType): Boolean = dataType match {
case NullType => true
case VariantType => false
case dt: AtomicType => true
case struct: StructType => struct.fields.forall(f => isOrderable(f.dataType))
case array: ArrayType => isOrderable(array.elementType)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ import org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
import org.apache.spark.sql.catalyst.plans.logical.Aggregate
import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, CharVarcharUtils}
import org.apache.spark.sql.errors.{QueryCompilationErrors, QueryErrorsBase, QueryExecutionErrors}
import org.apache.spark.sql.types.{DataType, MapType, StringType, StructType}
import org.apache.spark.sql.types.{DataType, MapType, StringType, StructType, VariantType}
import org.apache.spark.unsafe.types.UTF8String

object ExprUtils extends QueryErrorsBase {
Expand Down Expand Up @@ -193,6 +193,15 @@ object ExprUtils extends QueryErrorsBase {
messageParameters = Map("sqlExpr" -> expr.sql))
}

// Check if the data type of expr is orderable.
if (expr.dataType.existsRecursively(_.isInstanceOf[VariantType])) {
expr.failAnalysis(
errorClass = "GROUP_EXPRESSION_TYPE_IS_NOT_ORDERABLE",
messageParameters = Map(
"sqlExpr" -> toSQLExpr(expr),
"dataType" -> toSQLType(expr.dataType)))
}

if (!expr.deterministic) {
// This is just a sanity check, our analysis rule PullOutNondeterministic should
// already pull out those nondeterministic expressions and evaluate them in
Expand Down
29 changes: 29 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -251,4 +251,33 @@ class VariantSuite extends QueryTest with SharedSparkSession {
}
}
}

test("group/order/join variant are disabled") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests!

What about sort by? or window partition by, or window order by? Some examples:

select parse_json('') sort by 1

with t as (select 1 as a, parse_json('') as v)
select rank() over (partition by v order by a)

with t as (select 1 as a, parse_json('') as v)
select rank() over (partition by a order by v)

Copy link
Contributor Author

@chenhao-db chenhao-db Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very good point. Actually, the second statement (partition by v) will not fail as expected, and it is really an error in the type check for Window. Both partitionSpec and orderSpec must be orderable, but the current type check only includes orderSpec. If the partitionSpec contains a map type, the query will fail later in an optimizer rule with INTERNAL_ERROR. I will create another PR to fix the type check for Window.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can merge #45730 before this PR, we can add the second statement to the tests.

var ex = intercept[AnalysisException] {
spark.sql("select parse_json('') group by 1")
}
assert(ex.getErrorClass == "GROUP_EXPRESSION_TYPE_IS_NOT_ORDERABLE")

ex = intercept[AnalysisException] {
spark.sql("select parse_json('') order by 1")
}
assert(ex.getErrorClass == "DATATYPE_MISMATCH.INVALID_ORDERING_TYPE")

ex = intercept[AnalysisException] {
spark.sql("select parse_json('') sort by 1")
}
assert(ex.getErrorClass == "DATATYPE_MISMATCH.INVALID_ORDERING_TYPE")

ex = intercept[AnalysisException] {
spark.sql("with t as (select 1 as a, parse_json('') as v) " +
"select rank() over (partition by a order by v) from t")
}
assert(ex.getErrorClass == "DATATYPE_MISMATCH.INVALID_ORDERING_TYPE")

ex = intercept[AnalysisException] {
spark.sql("with t as (select parse_json('') as v) " +
"select t1.v from t as t1 join t as t2 on t1.v = t2.v")
}
assert(ex.getErrorClass == "DATATYPE_MISMATCH.INVALID_ORDERING_TYPE")
}
}