Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47007][SQL] Add the MapSort expression #45639

Closed
wants to merge 28 commits into from

Conversation

stevomitric
Copy link
Contributor

@stevomitric stevomitric commented Mar 21, 2024

What changes were proposed in this pull request?

Added the new MapSort expression in CollationOperations alongside new UTs.

Why are the changes needed?

In order to add the ability to do GROUP BY on map types we first have to be able to sort the maps by their key

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests in CollectionExpressionSuite

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Mar 21, 2024
* ascendingOrder - A boolean value describing the order in which the map will be sorted.
This can be either be ascending (true) or descending (false).
""",
examples = """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably don't need this part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed ExpressionDescription.

| Object $o1 = (($simpleEntryType) $o1entry).getKey();
| Object $o2 = (($simpleEntryType) $o2entry).getKey();
| $comp;
| return $order ? $c : -$c;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for just seeing this, but maybe we should do here the same thing that ArraySort does which is put the ordering into a variable outside of compare and just multiply it with the result?

this way we avoid branching in every comparison

Copy link
Contributor

@stefankandic stefankandic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if tests pass!

@@ -888,6 +888,158 @@ case class MapFromEntries(child: Expression)
copy(child = newChild)
}

case class MapSort(base: Expression, ascendingOrder: Expression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case class MapSort(base: Expression, ascendingOrder: Expression)
case class MapSort(base: Expression, ascendingOrder: Boolean)

Copy link
Contributor Author

@stevomitric stevomitric Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the BinaryExpression require two expressions here? Do we demote this to UnaryExpression?

EDIT: Expression for ascendingOrder in array sorting has been set as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's is the internal use-cases for the expression? Do we need this parameter at all?

Seems like you are going to pass true as ascendingOrder always at
https://github.com/apache/spark/pull/45549/files#diff-11264d807efa58054cca2d220aae8fba644ee0f0f2a4722c46d52828394846efR2488

   case a @ Aggregate(groupingExpr, x, b) =>
      val newGrouping = groupingExpr.map { expr =>
        (expr, expr.dataType) match {
          case (_: MapSort, _) => expr
          case (_, _: MapType) =>
            MapSort(expr, Literal.TrueLiteral)
          case _ => expr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the point of internal use, we don't need it. Refactored expression as UnaryExpression and removed ordering altogether.

|""".stripMargin
}

override def prettyName: String = "map_sort"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this since the expression hasn't been bound to the function name.

@@ -888,6 +888,158 @@ case class MapFromEntries(child: Expression)
copy(child = newChild)
}

case class MapSort(base: Expression, ascendingOrder: Expression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's is the internal use-cases for the expression? Do we need this parameter at all?

Seems like you are going to pass true as ascendingOrder always at
https://github.com/apache/spark/pull/45549/files#diff-11264d807efa58054cca2d220aae8fba644ee0f0f2a4722c46d52828394846efR2488

   case a @ Aggregate(groupingExpr, x, b) =>
      val newGrouping = groupingExpr.map { expr =>
        (expr, expr.dataType) match {
          case (_: MapSort, _) => expr
          case (_, _: MapType) =>
            MapSort(expr, Literal.TrueLiteral)
          case _ => expr

Comment on lines 906 to 907
case Literal(_: Boolean, BooleanType) =>
TypeCheckResult.TypeCheckSuccess
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have to be so strict on this argument? For example, could you imagine a case where you want to select the sort order based on the values in another column or the result of an expression? Is this needlessly restrictive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, could you imagine a case where you want to select the sort order based on the values in another column or the result of an expression?

I can imagine the case but so far we are going to use the expression internally for one case only. Support of ascendingOrder = false or even an arbitrary boolean expression just overcomplicates the code.

@MaxGekk MaxGekk changed the title [SPARK-47007] Added MapSort expression [SPARK-47007][SQL] Add the MapSort expression Mar 22, 2024
@MaxGekk
Copy link
Member

MaxGekk commented Mar 22, 2024

+1, LGTM. Merging to master.
Thank you, @stevomitric and @stefankandic @cloud-fan @markj-db for review.

@MaxGekk MaxGekk closed this in c94090e Mar 22, 2024
cloud-fan pushed a commit that referenced this pull request Mar 27, 2024
### What changes were proposed in this pull request?
Changes proposed in this PR include:
- Relaxed checks that prevent aggregating of map types
- Added new analyzer rule that uses `MapSort` expression proposed in [this PR](#45639)
- Created codegen that compares two sorted maps

### Why are the changes needed?
Adding new functionality to GROUP BY map types

### Does this PR introduce _any_ user-facing change?
Yes, ability to use `GROUP BY MapType`

### How was this patch tested?
With new UTs

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #45549 from stevomitric/stevomitric/map-group-by.

Lead-authored-by: Stevo Mitric <stevo.mitric@databricks.com>
Co-authored-by: Stefan Kandic <stefan.kandic@databricks.com>
Co-authored-by: Stevo Mitric <stevomitric2000@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
### What changes were proposed in this pull request?
Added the new `MapSort` expression in `CollationOperations` alongside new UTs.

### Why are the changes needed?
In order to add the ability to do GROUP BY on map types we first have to be able to sort the maps by their key

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New tests in `CollectionExpressionSuite`

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#45639 from stevomitric/stevomitric/map-expr.

Lead-authored-by: Stevo Mitric <stevo.mitric@databricks.com>
Co-authored-by: Stefan Kandic <stefan.kandic@databricks.com>
Co-authored-by: Stevo Mitric <stevomitric2000@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
### What changes were proposed in this pull request?
Changes proposed in this PR include:
- Relaxed checks that prevent aggregating of map types
- Added new analyzer rule that uses `MapSort` expression proposed in [this PR](apache#45639)
- Created codegen that compares two sorted maps

### Why are the changes needed?
Adding new functionality to GROUP BY map types

### Does this PR introduce _any_ user-facing change?
Yes, ability to use `GROUP BY MapType`

### How was this patch tested?
With new UTs

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#45549 from stevomitric/stevomitric/map-group-by.

Lead-authored-by: Stevo Mitric <stevo.mitric@databricks.com>
Co-authored-by: Stefan Kandic <stefan.kandic@databricks.com>
Co-authored-by: Stevo Mitric <stevomitric2000@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants