[SPARK-49261][SQL] Don't replace literals in aggregate expressions with group-by expressions #47876

bersprockets · 2024-08-26T19:48:09Z

What changes were proposed in this pull request?

Before this PR, RewriteDistinctAggregates could potentially replace literals in the aggregate expressions with output attributes from the Expand operator. This can occur when a group-by expression is a literal that happens by chance to match a literal used in an aggregate expression. E.g.:

create or replace temp view v1(a, b, c) as values
(1, 1.001d, 2), (2, 3.001d, 4), (2, 3.001, 4);

cache table v1;

select
  round(sum(b), 6) as sum1,
  count(distinct a) as count1,
  count(distinct c) as count2
from (
  select
    6 as gb,
    *
  from v1
)
group by a, gb;

In the optimized plan, you can see that the literal 6 in the round function invocation has been patched with an output attribute (6#163) from the Expand operator:

== Optimized Logical Plan ==
'Aggregate [a#123, 6#163], [round(first(sum(__auto_generated_subquery_name.b)#167, true) FILTER (WHERE (gid#162 = 0)), 6#163) AS sum1#114, count(__auto_generated_subquery_name.a#164) FILTER (WHERE (gid#162 = 1)) AS count1#115L, count(__auto_generated_subquery_name.c#165) FILTER (WHERE (gid#162 = 2)) AS count2#116L]
+- Aggregate [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162], [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162, sum(__auto_generated_subquery_name.b#166) AS sum(__auto_generated_subquery_name.b)#167]
   +- Expand [[a#123, 6, null, null, 0, b#124], [a#123, 6, a#123, null, 1, null], [a#123, 6, null, c#125, 2, null]], [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162, __auto_generated_subquery_name.b#166]
      +- InMemoryRelation [a#123, b#124, c#125], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- LocalTableScan [a#6, b#7, c#8]

This is because the literal 6 was used in the group-by expressions (referred to as gb in the query, and renamed 6#163 in the Expand operator's output attributes).

After this PR, foldable expressions in the aggregate expressions are kept as-is.

Why are the changes needed?

Some expressions require a foldable argument. In the above example, the round function requires a foldable expression as the scale argument. Because the scale argument is patched with an attribute, RoundBase#checkInputDataTypes returns an error, which leaves the Aggregate operator unresolved:

[INTERNAL_ERROR] Invalid call to dataType on unresolved object SQLSTATE: XX000
org.apache.spark.sql.catalyst.analysis.UnresolvedException: [INTERNAL_ERROR] Invalid call to dataType on unresolved object SQLSTATE: XX000
	at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:255)
	at org.apache.spark.sql.catalyst.types.DataTypeUtils$.$anonfun$fromAttributes$1(DataTypeUtils.scala:241)
	at scala.collection.immutable.List.map(List.scala:247)
	at scala.collection.immutable.List.map(List.scala:79)
	at org.apache.spark.sql.catalyst.types.DataTypeUtils$.fromAttributes(DataTypeUtils.scala:241)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:428)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:428)
	at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:474)
        ...

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

No.

bersprockets · 2024-08-27T00:02:09Z

Note this does result in a slight (but AFAIK harmless) oddity in the plan when the grouping expressions are included in the select statement, e.g.:

create or replace temp view v1(a, b, c) as values
(1, 1.001d, 2), (2, 3.001d, 4), (2, 3.001, 4);

cache table v1;

select
  a,
  gb,
  round(sum(b), 6) as sum1,
  count(distinct a) as count1,
  count(distinct c) as count2
from (
  select 6 as gb, *
  from v1
)
group by a, gb;

The aggregate operator will look like this:

Aggregate [a#257, 6#294], [a#257, 6 AS gb#244, round(first(sum(__auto_generated_subquery_name.b)#298, true) FILTER (WHERE (gid#293 = 0)), 6) AS sum1#245, count(__auto_generated_subquery_name.a#295) FILTER (WHERE (gid#293 = 1)) AS count1#246L, count(__auto_generated_subquery_name.c#296) FILTER (WHERE (gid#293 = 2)) AS count2#247L]

In the list of grouping expressions, gb will show up as the attribute 6#294 (taken from the Expand operator's output), but gb will show up in the aggregate expressions as 6 AS gb#244.

dongjoon-hyun · 2024-09-11T20:11:28Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

+    withTempView("v1") {
+      data.createOrReplaceTempView("v1")
+      val df =
+        sql("""select


nit. Could you capitalize SQL syntax, @bersprockets ?

dongjoon-hyun

+1, LGTM.

cc @cloud-fan , @yaooqinn , @itholic (as a release manager of Apache Spark 3.5.3)

bersprockets · 2024-09-12T00:11:29Z

By the way, this is not a recent regression. I can reproduce on 3.2.2. It's possible it's always been this way.

…th group-by expressions ### What changes were proposed in this pull request? Before this PR, `RewriteDistinctAggregates` could potentially replace literals in the aggregate expressions with output attributes from the `Expand` operator. This can occur when a group-by expression is a literal that happens by chance to match a literal used in an aggregate expression. E.g.: ``` create or replace temp view v1(a, b, c) as values (1, 1.001d, 2), (2, 3.001d, 4), (2, 3.001, 4); cache table v1; select round(sum(b), 6) as sum1, count(distinct a) as count1, count(distinct c) as count2 from ( select 6 as gb, * from v1 ) group by a, gb; ``` In the optimized plan, you can see that the literal 6 in the `round` function invocation has been patched with an output attribute (6#163) from the `Expand` operator: ``` == Optimized Logical Plan == 'Aggregate [a#123, 6#163], [round(first(sum(__auto_generated_subquery_name.b)#167, true) FILTER (WHERE (gid#162 = 0)), 6#163) AS sum1#114, count(__auto_generated_subquery_name.a#164) FILTER (WHERE (gid#162 = 1)) AS count1#115L, count(__auto_generated_subquery_name.c#165) FILTER (WHERE (gid#162 = 2)) AS count2#116L] +- Aggregate [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162], [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162, sum(__auto_generated_subquery_name.b#166) AS sum(__auto_generated_subquery_name.b)#167] +- Expand [[a#123, 6, null, null, 0, b#124], [a#123, 6, a#123, null, 1, null], [a#123, 6, null, c#125, 2, null]], [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162, __auto_generated_subquery_name.b#166] +- InMemoryRelation [a#123, b#124, c#125], StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [a#6, b#7, c#8] ``` This is because the literal 6 was used in the group-by expressions (referred to as gb in the query, and renamed 6#163 in the `Expand` operator's output attributes). After this PR, foldable expressions in the aggregate expressions are kept as-is. ### Why are the changes needed? Some expressions require a foldable argument. In the above example, the `round` function requires a foldable expression as the scale argument. Because the scale argument is patched with an attribute, `RoundBase#checkInputDataTypes` returns an error, which leaves the `Aggregate` operator unresolved: ``` [INTERNAL_ERROR] Invalid call to dataType on unresolved object SQLSTATE: XX000 org.apache.spark.sql.catalyst.analysis.UnresolvedException: [INTERNAL_ERROR] Invalid call to dataType on unresolved object SQLSTATE: XX000 at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:255) at org.apache.spark.sql.catalyst.types.DataTypeUtils$.$anonfun$fromAttributes$1(DataTypeUtils.scala:241) at scala.collection.immutable.List.map(List.scala:247) at scala.collection.immutable.List.map(List.scala:79) at org.apache.spark.sql.catalyst.types.DataTypeUtils$.fromAttributes(DataTypeUtils.scala:241) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:428) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:428) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:474) ... ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47876 from bersprockets/group_by_lit_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1a0791d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2024-09-12T15:20:26Z

Merged to master/3.5/3.4.

…th group-by expressions ### What changes were proposed in this pull request? Before this PR, `RewriteDistinctAggregates` could potentially replace literals in the aggregate expressions with output attributes from the `Expand` operator. This can occur when a group-by expression is a literal that happens by chance to match a literal used in an aggregate expression. E.g.: ``` create or replace temp view v1(a, b, c) as values (1, 1.001d, 2), (2, 3.001d, 4), (2, 3.001, 4); cache table v1; select round(sum(b), 6) as sum1, count(distinct a) as count1, count(distinct c) as count2 from ( select 6 as gb, * from v1 ) group by a, gb; ``` In the optimized plan, you can see that the literal 6 in the `round` function invocation has been patched with an output attribute (6#163) from the `Expand` operator: ``` == Optimized Logical Plan == 'Aggregate [a#123, 6#163], [round(first(sum(__auto_generated_subquery_name.b)#167, true) FILTER (WHERE (gid#162 = 0)), 6#163) AS sum1#114, count(__auto_generated_subquery_name.a#164) FILTER (WHERE (gid#162 = 1)) AS count1#115L, count(__auto_generated_subquery_name.c#165) FILTER (WHERE (gid#162 = 2)) AS count2#116L] +- Aggregate [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162], [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162, sum(__auto_generated_subquery_name.b#166) AS sum(__auto_generated_subquery_name.b)#167] +- Expand [[a#123, 6, null, null, 0, b#124], [a#123, 6, a#123, null, 1, null], [a#123, 6, null, c#125, 2, null]], [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162, __auto_generated_subquery_name.b#166] +- InMemoryRelation [a#123, b#124, c#125], StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [a#6, b#7, c#8] ``` This is because the literal 6 was used in the group-by expressions (referred to as gb in the query, and renamed 6#163 in the `Expand` operator's output attributes). After this PR, foldable expressions in the aggregate expressions are kept as-is. ### Why are the changes needed? Some expressions require a foldable argument. In the above example, the `round` function requires a foldable expression as the scale argument. Because the scale argument is patched with an attribute, `RoundBase#checkInputDataTypes` returns an error, which leaves the `Aggregate` operator unresolved: ``` [INTERNAL_ERROR] Invalid call to dataType on unresolved object SQLSTATE: XX000 org.apache.spark.sql.catalyst.analysis.UnresolvedException: [INTERNAL_ERROR] Invalid call to dataType on unresolved object SQLSTATE: XX000 at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:255) at org.apache.spark.sql.catalyst.types.DataTypeUtils$.$anonfun$fromAttributes$1(DataTypeUtils.scala:241) at scala.collection.immutable.List.map(List.scala:247) at scala.collection.immutable.List.map(List.scala:79) at org.apache.spark.sql.catalyst.types.DataTypeUtils$.fromAttributes(DataTypeUtils.scala:241) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:428) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:428) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:474) ... ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47876 from bersprockets/group_by_lit_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1a0791d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

cloud-fan · 2024-09-12T15:42:55Z

late LGTM

yaooqinn · 2024-09-13T01:58:54Z

late LGTM

…th group-by expressions ### What changes were proposed in this pull request? Before this PR, `RewriteDistinctAggregates` could potentially replace literals in the aggregate expressions with output attributes from the `Expand` operator. This can occur when a group-by expression is a literal that happens by chance to match a literal used in an aggregate expression. E.g.: ``` create or replace temp view v1(a, b, c) as values (1, 1.001d, 2), (2, 3.001d, 4), (2, 3.001, 4); cache table v1; select round(sum(b), 6) as sum1, count(distinct a) as count1, count(distinct c) as count2 from ( select 6 as gb, * from v1 ) group by a, gb; ``` In the optimized plan, you can see that the literal 6 in the `round` function invocation has been patched with an output attribute (6#163) from the `Expand` operator: ``` == Optimized Logical Plan == 'Aggregate [a#123, 6#163], [round(first(sum(__auto_generated_subquery_name.b)apache#167, true) FILTER (WHERE (gid#162 = 0)), 6#163) AS sum1#114, count(__auto_generated_subquery_name.a#164) FILTER (WHERE (gid#162 = 1)) AS count1#115L, count(__auto_generated_subquery_name.c#165) FILTER (WHERE (gid#162 = 2)) AS count2#116L] +- Aggregate [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162], [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162, sum(__auto_generated_subquery_name.b#166) AS sum(__auto_generated_subquery_name.b)apache#167] +- Expand [[a#123, 6, null, null, 0, b#124], [a#123, 6, a#123, null, 1, null], [a#123, 6, null, c#125, 2, null]], [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162, __auto_generated_subquery_name.b#166] +- InMemoryRelation [a#123, b#124, c#125], StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [a#6, b#7, c#8] ``` This is because the literal 6 was used in the group-by expressions (referred to as gb in the query, and renamed 6#163 in the `Expand` operator's output attributes). After this PR, foldable expressions in the aggregate expressions are kept as-is. ### Why are the changes needed? Some expressions require a foldable argument. In the above example, the `round` function requires a foldable expression as the scale argument. Because the scale argument is patched with an attribute, `RoundBase#checkInputDataTypes` returns an error, which leaves the `Aggregate` operator unresolved: ``` [INTERNAL_ERROR] Invalid call to dataType on unresolved object SQLSTATE: XX000 org.apache.spark.sql.catalyst.analysis.UnresolvedException: [INTERNAL_ERROR] Invalid call to dataType on unresolved object SQLSTATE: XX000 at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:255) at org.apache.spark.sql.catalyst.types.DataTypeUtils$.$anonfun$fromAttributes$1(DataTypeUtils.scala:241) at scala.collection.immutable.List.map(List.scala:247) at scala.collection.immutable.List.map(List.scala:79) at org.apache.spark.sql.catalyst.types.DataTypeUtils$.fromAttributes(DataTypeUtils.scala:241) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:428) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:428) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:474) ... ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47876 from bersprockets/group_by_lit_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 1a0791d) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit ba05a6b)

…th group-by expressions ### What changes were proposed in this pull request? Before this PR, `RewriteDistinctAggregates` could potentially replace literals in the aggregate expressions with output attributes from the `Expand` operator. This can occur when a group-by expression is a literal that happens by chance to match a literal used in an aggregate expression. E.g.: ``` create or replace temp view v1(a, b, c) as values (1, 1.001d, 2), (2, 3.001d, 4), (2, 3.001, 4); cache table v1; select round(sum(b), 6) as sum1, count(distinct a) as count1, count(distinct c) as count2 from ( select 6 as gb, * from v1 ) group by a, gb; ``` In the optimized plan, you can see that the literal 6 in the `round` function invocation has been patched with an output attribute (6#163) from the `Expand` operator: ``` == Optimized Logical Plan == 'Aggregate [a#123, 6#163], [round(first(sum(__auto_generated_subquery_name.b)apache#167, true) FILTER (WHERE (gid#162 = 0)), 6#163) AS sum1#114, count(__auto_generated_subquery_name.a#164) FILTER (WHERE (gid#162 = 1)) AS count1#115L, count(__auto_generated_subquery_name.c#165) FILTER (WHERE (gid#162 = 2)) AS count2#116L] +- Aggregate [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162], [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162, sum(__auto_generated_subquery_name.b#166) AS sum(__auto_generated_subquery_name.b)apache#167] +- Expand [[a#123, 6, null, null, 0, b#124], [a#123, 6, a#123, null, 1, null], [a#123, 6, null, c#125, 2, null]], [a#123, 6#163, __auto_generated_subquery_name.a#164, __auto_generated_subquery_name.c#165, gid#162, __auto_generated_subquery_name.b#166] +- InMemoryRelation [a#123, b#124, c#125], StorageLevel(disk, memory, deserialized, 1 replicas) +- LocalTableScan [a#6, b#7, c#8] ``` This is because the literal 6 was used in the group-by expressions (referred to as gb in the query, and renamed 6#163 in the `Expand` operator's output attributes). After this PR, foldable expressions in the aggregate expressions are kept as-is. ### Why are the changes needed? Some expressions require a foldable argument. In the above example, the `round` function requires a foldable expression as the scale argument. Because the scale argument is patched with an attribute, `RoundBase#checkInputDataTypes` returns an error, which leaves the `Aggregate` operator unresolved: ``` [INTERNAL_ERROR] Invalid call to dataType on unresolved object SQLSTATE: XX000 org.apache.spark.sql.catalyst.analysis.UnresolvedException: [INTERNAL_ERROR] Invalid call to dataType on unresolved object SQLSTATE: XX000 at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:255) at org.apache.spark.sql.catalyst.types.DataTypeUtils$.$anonfun$fromAttributes$1(DataTypeUtils.scala:241) at scala.collection.immutable.List.map(List.scala:247) at scala.collection.immutable.List.map(List.scala:79) at org.apache.spark.sql.catalyst.types.DataTypeUtils$.fromAttributes(DataTypeUtils.scala:241) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema$lzycompute(QueryPlan.scala:428) at org.apache.spark.sql.catalyst.plans.QueryPlan.schema(QueryPlan.scala:428) at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:474) ... ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47876 from bersprockets/group_by_lit_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

github-actions bot added the SQL label Aug 26, 2024

bersprockets changed the title ~~[SPARK-49261][SQL] Don't replace literals in aggregate expressions with Expand output attributes that stand in for group-by expressions~~ [SPARK-49261][SQL] Don't replace literals in aggregate expressions with group-by expressions Aug 27, 2024

bersprockets force-pushed the group_by_lit_issue branch from 4e951ce to fab9a41 Compare September 2, 2024 01:47

dongjoon-hyun reviewed Sep 11, 2024

View reviewed changes

dongjoon-hyun approved these changes Sep 11, 2024

View reviewed changes

bersprockets added 7 commits September 11, 2024 13:46

potential fix

abb7a3d

Test

7e65416

Update

9df15f6

Add test

4be26ae

Update test names

5e5788f

Fix test names

4de1c48

Review feedback

be0a975

bersprockets force-pushed the group_by_lit_issue branch from fab9a41 to be0a975 Compare September 11, 2024 22:30

dongjoon-hyun closed this in 1a0791d Sep 12, 2024

bersprockets deleted the group_by_lit_issue branch October 18, 2024 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49261][SQL] Don't replace literals in aggregate expressions with group-by expressions #47876

[SPARK-49261][SQL] Don't replace literals in aggregate expressions with group-by expressions #47876

bersprockets commented Aug 26, 2024

bersprockets commented Aug 27, 2024

dongjoon-hyun Sep 11, 2024

dongjoon-hyun left a comment

bersprockets commented Sep 12, 2024

dongjoon-hyun commented Sep 12, 2024

cloud-fan commented Sep 12, 2024

yaooqinn commented Sep 13, 2024

[SPARK-49261][SQL] Don't replace literals in aggregate expressions with group-by expressions #47876

[SPARK-49261][SQL] Don't replace literals in aggregate expressions with group-by expressions #47876

Conversation

bersprockets commented Aug 26, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

bersprockets commented Aug 27, 2024

dongjoon-hyun Sep 11, 2024

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

bersprockets commented Sep 12, 2024

dongjoon-hyun commented Sep 12, 2024

cloud-fan commented Sep 12, 2024

yaooqinn commented Sep 13, 2024