[SPARK-47430][SQL] Support GROUP BY for MapType #45549

stevomitric · 2024-03-17T10:09:03Z

What changes were proposed in this pull request?

Changes proposed in this PR include:

Relaxed checks that prevent aggregating of map types
Added new analyzer rule that uses MapSort expression proposed in this PR
Created codegen that compares two sorted maps

Why are the changes needed?

Adding new functionality to GROUP BY map types

Does this PR introduce any user-facing change?

Yes, ability to use GROUP BY MapType

How was this patch tested?

With new UTs

Was this patch authored or co-authored using generative AI tooling?

No

…sSuite.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>

Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>

…essions/codegen/CodeGenerator.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

cloud-fan · 2024-03-25T15:41:31Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+       |  ArrayData $keyArrayA = a.keyArray();
+       |  ArrayData $valueArrayA = a.valueArray();
+       |  ArrayData $keyArrayB = b.keyArray();
+       |  ArrayData $valueArrayB = b.valueArray();


do the above 4 variables need to use freshName? They are just local variables in this method.

cloud-fan · 2024-03-25T15:41:43Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+       |  ArrayData $valueArrayA = a.valueArray();
+       |  ArrayData $keyArrayB = b.keyArray();
+       |  ArrayData $valueArrayB = b.valueArray();
+       |  int $minLength = (lengthA > lengthB) ? lengthB : lengthA;


cloud-fan · 2024-03-25T16:07:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -244,7 +244,9 @@ abstract class Optimizer(catalogManager: CatalogManager)
      RemoveRedundantAliases,
      RemoveNoopOperators) :+
    // This batch must be executed after the `RewriteSubquery` batch, which creates joins.
-    Batch("NormalizeFloatingNumbers", Once, NormalizeFloatingNumbers) :+
+    Batch("NormalizeFloatingNumbers", Once,
+      InsertMapSortInGroupingExpressions,


can we create a new batch for this rule?

cloud-fan · 2024-03-25T16:09:03Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala

@@ -2155,8 +2155,8 @@ class DataFrameAggregateSuite extends QueryTest
    )
  }

-  test("SPARK-46536 Support GROUP BY CalendarIntervalType") {
-    val numRows = 50
+  private def assertAggregateOnDataframe(dfSeq: Seq[DataFrame],


I'd rather test one DataFrame at a time, and the caller calls assertAggregateOnDataframe multiple times.

…essions/codegen/CodeGenerator.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

cloud-fan · 2024-03-26T23:45:31Z

there are still test failures

stevomitric · 2024-03-27T08:27:36Z

there are still test failures

Build should be fixed now.

cloud-fan · 2024-03-27T10:54:31Z

thanks, merging to master!

### What changes were proposed in this pull request? Added normalization of map keys when they are put in `ArrayBasedMapBuilder`. ### Why are the changes needed? As map keys need to be unique, we need to add normalization on floating point numbers and prevent the following case when building a map: `Map(0.0, -0.0)`. This further unblocks GROUP BY statement for Map Types as per [this discussion](#45549 (comment)). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UTs in `ArrayBasedMapBuilderSuite` ### Was this patch authored or co-authored using generative AI tooling? No Closes #45721 from stevomitric/stevomitric/fix-map-dup. Authored-by: Stevo Mitric <stevo.mitric@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Changes proposed in this PR include: - Relaxed checks that prevent aggregating of map types - Added new analyzer rule that uses `MapSort` expression proposed in [this PR](apache#45639) - Created codegen that compares two sorted maps ### Why are the changes needed? Adding new functionality to GROUP BY map types ### Does this PR introduce _any_ user-facing change? Yes, ability to use `GROUP BY MapType` ### How was this patch tested? With new UTs ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45549 from stevomitric/stevomitric/map-group-by. Lead-authored-by: Stevo Mitric <stevo.mitric@databricks.com> Co-authored-by: Stefan Kandic <stefan.kandic@databricks.com> Co-authored-by: Stevo Mitric <stevomitric2000@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Added normalization of map keys when they are put in `ArrayBasedMapBuilder`. ### Why are the changes needed? As map keys need to be unique, we need to add normalization on floating point numbers and prevent the following case when building a map: `Map(0.0, -0.0)`. This further unblocks GROUP BY statement for Map Types as per [this discussion](apache#45549 (comment)). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UTs in `ArrayBasedMapBuilderSuite` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#45721 from stevomitric/stevomitric/fix-map-dup. Authored-by: Stevo Mitric <stevo.mitric@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

stefankandic and others added 15 commits February 29, 2024 09:48

initial working version

a081649

add golden files

1441549

add map sort to other languages

1be06e3

fix typoes

249e903

fix scalastyle issue

aaae883

add proto golden files

acaf95e

fix python function call

5619fdb

fix ci errors

7754c14

fix ci checks

f0ebf5d

Optimized map-sort by switching to array sorting

1f78167

Potential tests fix

a5eb480

Potential tests fix 2

9497f99

Allowed group by expression with Maps

5e38220

replaced map data type with arrays in test

03a752d

Added codegen for map ordering

b80afed

github-actions bot added SQL DOCS PYTHON R CONNECT labels Mar 17, 2024

stevomitric and others added 10 commits March 17, 2024 12:25

Removed TODOs and changed parmIndex to ordinal

5e7a033

Shortened map sort function and added more docs

ab70f1e

updated map_sort test suite

e79d65c

Added map normalization and import cleanup

28d6f70

Update sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunction…

a435355

…sSuite.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>

Update sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunction…

c9901d0

…sSuite.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>

docs fix

da6a710

Updated codegen and removed once test-case

81008c2

Update python/pyspark/sql/functions/builtin.py

86b29c5

Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>

Updated 'select.show' to give more info in map_sort desc

c08ab6c

stevomitric and others added 4 commits March 25, 2024 13:06

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expr…

04d68cc

…essions/codegen/CodeGenerator.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

added scala stripMargin identation control

ebb3325

Replaced comparison in array with genCompElementsAt

185f7f1

Refactor optimizer rule for InsertMapSortInGroupingExpressions

3137f6a

stevomitric requested a review from cloud-fan March 25, 2024 12:39

cloud-fan reviewed Mar 25, 2024

View reviewed changes

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Mar 25, 2024

View reviewed changes

stevomitric mentioned this pull request Mar 26, 2024

[SPARK-47563][SQL] Add map normalization on creation #45721

Closed

stevomitric and others added 3 commits March 26, 2024 11:22

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expr…

14fdcd2

…essions/codegen/CodeGenerator.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

Refactored code-gen and separated optimizer rule in separate batch

7fe7b7e

refactored tests

8076045

stevomitric requested a review from cloud-fan March 26, 2024 10:38

cloud-fan approved these changes Mar 26, 2024

View reviewed changes

Regenerated sql-error-conditions.md

3c573e0

github-actions bot added the DOCS label Mar 26, 2024

Removed a test that checks for Map as an invalid grouping type

c6050c0

Fixed map-group-by test

3eac76c

stefankandic approved these changes Mar 27, 2024

View reviewed changes

cloud-fan closed this in d57164a Mar 27, 2024

chenhao-db mentioned this pull request Mar 29, 2024

[SPARK-47572][SQL] Enforce Window partitionSpec is orderable. #45730

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47430][SQL] Support GROUP BY for MapType #45549

[SPARK-47430][SQL] Support GROUP BY for MapType #45549

stevomitric commented Mar 17, 2024 •

edited

Loading

cloud-fan Mar 25, 2024

cloud-fan Mar 25, 2024

cloud-fan Mar 25, 2024

cloud-fan Mar 25, 2024

cloud-fan commented Mar 26, 2024 •

edited

Loading

stevomitric commented Mar 27, 2024

cloud-fan commented Mar 27, 2024

[SPARK-47430][SQL] Support GROUP BY for MapType #45549

[SPARK-47430][SQL] Support GROUP BY for MapType #45549

Conversation

stevomitric commented Mar 17, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

cloud-fan Mar 25, 2024

Choose a reason for hiding this comment

cloud-fan Mar 25, 2024

Choose a reason for hiding this comment

cloud-fan Mar 25, 2024

Choose a reason for hiding this comment

cloud-fan Mar 25, 2024

Choose a reason for hiding this comment

cloud-fan commented Mar 26, 2024 • edited Loading

stevomitric commented Mar 27, 2024

cloud-fan commented Mar 27, 2024

stevomitric commented Mar 17, 2024 •

edited

Loading

cloud-fan commented Mar 26, 2024 •

edited

Loading