Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14830][SQL] Add RemoveRepetitionFromGroupExpressions optimizer. #12590

Closed
wants to merge 3 commits into from
Closed

[SPARK-14830][SQL] Add RemoveRepetitionFromGroupExpressions optimizer. #12590

wants to merge 3 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Apr 22, 2016

What changes were proposed in this pull request?

This PR aims to optimize GroupExpressions by removing repeating expressions. RemoveRepetitionFromGroupExpressions is added.

Before

scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain()
== Physical Plan ==
WholeStageCodegen
:  +- TungstenAggregate(key=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9], functions=[], output=[(a + 1)#5])
:     +- INPUT
+- Exchange hashpartitioning((a#0 + 1)#6, (1 + a#0)#7, (A#0 + 1)#8, (1 + A#0)#9, 200), None
   +- WholeStageCodegen
      :  +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6,(1 + a#0) AS (1 + a#0)#7,(A#0 + 1) AS (A#0 + 1)#8,(1 + A#0) AS (1 + A#0)#9], functions=[], output=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9])
      :     +- INPUT
      +- LocalTableScan [a#0], [[1],[2]]

After

scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain()
== Physical Plan ==
WholeStageCodegen
:  +- TungstenAggregate(key=[(a#0 + 1)#6], functions=[], output=[(a + 1)#5])
:     +- INPUT
+- Exchange hashpartitioning((a#0 + 1)#6, 200), None
   +- WholeStageCodegen
      :  +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6], functions=[], output=[(a#0 + 1)#6])
      :     +- INPUT
      +- LocalTableScan [a#0], [[1],[2]]

How was this patch tested?

Pass the Jenkins tests (with a new testcase)

@SparkQA
Copy link

SparkQA commented Apr 22, 2016

Test build #56613 has finished for PR 12590 at commit 75b8f73.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Rebased.

@SparkQA
Copy link

SparkQA commented Apr 23, 2016

Test build #56772 has finished for PR 12590 at commit bda4ae6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Could you review this PR about RemoveRepetitionFromGroupExpressions?

@dongjoon-hyun
Copy link
Member Author

Rebased.

@SparkQA
Copy link

SparkQA commented Apr 25, 2016

Test build #56891 has finished for PR 12590 at commit 78a42c5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @rxin .
Could you review this PR please?

@SparkQA
Copy link

SparkQA commented Apr 26, 2016

Test build #56970 has finished for PR 12590 at commit 01ce1a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @marmbrus .
Could you review this PR when you have some time?

@SparkQA
Copy link

SparkQA commented Apr 28, 2016

Test build #57274 has finished for PR 12590 at commit e3bdc16.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 30, 2016

Test build #57411 has finished for PR 12590 at commit d16ebc2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

object RemoveRepetitionFromGroupExpressions extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case a @ Aggregate(grouping, _, _) =>
val newGrouping = grouping.distinct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to miss cases like GROUP BY A, a. I think you want to use an ExpressionSet instead of normal .distinct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thank you for review. @marmbrus .
I'll fix that.

@SparkQA
Copy link

SparkQA commented May 2, 2016

Test build #57540 has finished for PR 12590 at commit 2198f0f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

@marmbrus . Now, it's ready for review again.
This PR becomes to do much better than what I expected.
Thank you so much.

@marmbrus
Copy link
Contributor

marmbrus commented May 2, 2016

Thanks, merging to master and 2.0

asfgit pushed a commit that referenced this pull request May 2, 2016
## What changes were proposed in this pull request?

This PR aims to optimize GroupExpressions by removing repeating expressions. `RemoveRepetitionFromGroupExpressions` is added.

**Before**
```scala
scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain()
== Physical Plan ==
WholeStageCodegen
:  +- TungstenAggregate(key=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9], functions=[], output=[(a + 1)#5])
:     +- INPUT
+- Exchange hashpartitioning((a#0 + 1)#6, (1 + a#0)#7, (A#0 + 1)#8, (1 + A#0)#9, 200), None
   +- WholeStageCodegen
      :  +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6,(1 + a#0) AS (1 + a#0)#7,(A#0 + 1) AS (A#0 + 1)#8,(1 + A#0) AS (1 + A#0)#9], functions=[], output=[(a#0 + 1)#6,(1 + a#0)#7,(A#0 + 1)#8,(1 + A#0)#9])
      :     +- INPUT
      +- LocalTableScan [a#0], [[1],[2]]
```

**After**
```scala
scala> sql("select a+1 from values 1,2 T(a) group by a+1, 1+a, A+1, 1+A").explain()
== Physical Plan ==
WholeStageCodegen
:  +- TungstenAggregate(key=[(a#0 + 1)#6], functions=[], output=[(a + 1)#5])
:     +- INPUT
+- Exchange hashpartitioning((a#0 + 1)#6, 200), None
   +- WholeStageCodegen
      :  +- TungstenAggregate(key=[(a#0 + 1) AS (a#0 + 1)#6], functions=[], output=[(a#0 + 1)#6])
      :     +- INPUT
      +- LocalTableScan [a#0], [[1],[2]]
```

## How was this patch tested?

Pass the Jenkins tests (with a new testcase)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12590 from dongjoon-hyun/SPARK-14830.

(cherry picked from commit 6e63201)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@asfgit asfgit closed this in 6e63201 May 2, 2016
@dongjoon-hyun
Copy link
Member Author

Thank you, @marmbrus !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-14830 branch May 12, 2016 01:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants