[SPARK-2213] [SQL] sort merge join for spark sql #5208

adrian-wang · 2015-03-26T10:06:36Z

Thanks for the initial work from @Ishiihara in #3173

This PR introduce a new join method of sort merge join, which firstly ensure that keys of same value are in the same partition, and inside each partition the Rows are sorted by key. Then we can run down both sides together, find matched rows using sort merge join. In this way, we don't have to store the whole hash table of one side as hash join, thus we have less memory usage. Also, this PR would benefit from #3438 , making the sorting phrase much more efficient.

We introduced a new configuration of "spark.sql.planner.sortMergeJoin" to switch between this(true) and ShuffledHashJoin(false), probably we want the default value of it be false at first.

SparkQA · 2015-03-26T10:08:13Z

Test build #29224 has started for PR 5208 at commit b87df90.

This patch merges cleanly.

SparkQA · 2015-03-26T10:28:59Z

Test build #29224 has finished for PR 5208 at commit b87df90.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ClusteredOrderedDistribution(clustering: Seq[Expression])
- case class HashSortedPartitioning(expressions: Seq[Expression], numPartitions: Int)
- case class SortMergeJoin(

AmplabJenkins · 2015-03-26T10:29:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29224/
Test FAILed.

SparkQA · 2015-03-30T04:18:16Z

Test build #29382 has started for PR 5208 at commit cb1e18d.

SparkQA · 2015-03-30T04:19:41Z

Test build #29382 has finished for PR 5208 at commit cb1e18d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ClusteredOrderedDistribution(clustering: Seq[Expression])
- case class HashSortedPartitioning(expressions: Seq[Expression], numPartitions: Int)
- case class SortMergeJoin(
This patch does not change any dependencies.

AmplabJenkins · 2015-03-30T04:19:42Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29382/
Test FAILed.

SparkQA · 2015-03-30T04:28:19Z

Test build #29383 has started for PR 5208 at commit 6df9f01.

SparkQA · 2015-03-30T05:14:20Z

Test build #29383 has finished for PR 5208 at commit 6df9f01.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ClusteredOrderedDistribution(clustering: Seq[Expression])
- case class HashSortedPartitioning(expressions: Seq[Expression], numPartitions: Int)
- case class SortMergeJoin(
This patch does not change any dependencies.

AmplabJenkins · 2015-03-30T05:14:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29383/
Test FAILed.

SparkQA · 2015-04-01T07:38:22Z

Test build #29530 has started for PR 5208 at commit d7bfe07.

SparkQA · 2015-04-01T07:39:19Z

Test build #29530 has finished for PR 5208 at commit d7bfe07.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ClusteredOrderedDistribution(clustering: Seq[Expression])
- case class HashSortedPartitioning(expressions: Seq[Expression], numPartitions: Int)
- case class SortMergeJoin(
This patch does not change any dependencies.

AmplabJenkins · 2015-04-01T07:39:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29530/
Test FAILed.

SparkQA · 2015-04-01T07:53:18Z

Test build #29532 has started for PR 5208 at commit c34c96e.

SparkQA · 2015-04-01T08:09:19Z

Test build #29532 has finished for PR 5208 at commit c34c96e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ClusteredOrderedDistribution(clustering: Seq[Expression])
- case class HashSortedPartitioning(expressions: Seq[Expression], numPartitions: Int)
- case class SortMergeJoin(
This patch does not change any dependencies.

AmplabJenkins · 2015-04-01T08:09:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29532/
Test FAILed.

SparkQA · 2015-04-01T08:38:23Z

Test build #29533 has started for PR 5208 at commit f5f81db.

SparkQA · 2015-04-01T08:54:10Z

Test build #29533 has finished for PR 5208 at commit f5f81db.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ClusteredOrderedDistribution(clustering: Seq[Expression])
- case class HashSortedPartitioning(expressions: Seq[Expression], numPartitions: Int)
- case class SortMergeJoin(
This patch does not change any dependencies.

AmplabJenkins · 2015-04-01T08:54:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29533/
Test FAILed.

adrian-wang · 2015-04-01T09:05:05Z

I am not getting this error locally... what's wrong?

adrian-wang · 2015-04-01T09:25:34Z

This exception only exists on current master, I didn't get this locally because I was working on a March-26 master. This could be a potential bug we introduced during this period.

cc @chenghao-intel

chenghao-intel · 2015-04-01T15:47:22Z

From the log, seems the output fields of the PhysicalRDD changed its order, can you rebase against the latest code and try again in your local?

== Physical Plan ==
Project [b#2957,a#2959]
 SortMergeJoin [a#2956], [b#2960], Inner
  Exchange (HashSortedPartitioning [a#2956], 200)
   PhysicalRDD [b#2957,a#2956], MapPartitionsRDD[1584] at map at FilteredScanSuite.scala:85
  Exchange (HashSortedPartitioning [b#2960], 200)
   PhysicalRDD [a#2959,b#2960], MapPartitionsRDD[1587] at map at FilteredScanSuite.scala:85

adrian-wang · 2015-04-01T15:50:37Z

yes, after rebase i can see this exception

SparkQA · 2015-04-02T03:18:34Z

Test build #29584 has started for PR 5208 at commit 7a869c5.

SparkQA · 2015-04-02T04:43:40Z

Test build #29584 has finished for PR 5208 at commit 7a869c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class ClusteredOrderedDistribution(clustering: Seq[Expression])
- case class HashSortedPartitioning(expressions: Seq[Expression], numPartitions: Int)
- case class SortMergeJoin(
This patch does not change any dependencies.

AmplabJenkins · 2015-04-02T04:43:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29584/
Test PASSed.

adrian-wang · 2015-04-02T04:57:58Z

cc @marmbrus @liancheng @yhuai @chenghao-intel

chenghao-intel · 2015-04-02T06:22:25Z

sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala

+   * By default it will choose sort merge join.
+   */
+  private[spark] def autoSortMergeJoin: Boolean =
+    getConf(AUTO_SORTMERGEJOIN, true.toString).toBoolean


Let's make it false as default, the SMJ should be experimental feature.

OK, just use true for Jenkins testing.

yhuai · 2015-04-15T05:52:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala


-      if (meetsRequirements && compatible) {
+          val withSort = if (needSort) {
+            Sort(rowOrdering, global = false, withShuffle)


Like what we do in SparkStrategies, use execution.ExternalSort when sqlContext.conf.externalSortEnabled is true.

SparkQA · 2015-04-15T05:53:34Z

Test build #30309 has started for PR 5208 at commit f515cd2.

yhuai · 2015-04-15T05:53:42Z

sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala

+          case (UnspecifiedDistribution, Seq(), child) =>
+            child
+          case (UnspecifiedDistribution, rowOrdering, child) =>
+            Sort(rowOrdering, global = false, child)


Use execution.ExternalSort when sqlContext.conf.externalSortEnabled is true.

SparkQA · 2015-04-15T06:12:49Z

Test build #30315 has started for PR 5208 at commit f91a2ae.

SparkQA · 2015-04-15T06:16:57Z

Test build #30309 has finished for PR 5208 at commit f515cd2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Exchange(
- case class SortMergeJoin(
This patch adds the following new dependencies:
- snappy-java-1.1.1.7.jar
This patch removes the following dependencies:
- snappy-java-1.1.1.6.jar

AmplabJenkins · 2015-04-15T06:16:58Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30309/
Test FAILed.

adrian-wang · 2015-04-15T06:22:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala

@@ -87,7 +126,12 @@ case class Exchange(newPartitioning: Partitioning, child: SparkPlan) extends Una
        implicit val ordering = new RowOrdering(sortingExpressions, child.output)


maybe this line is redundant?

oh, I see... For RangePartitioner..

SparkQA · 2015-04-15T06:36:54Z

Test build #30315 has finished for PR 5208 at commit f91a2ae.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Exchange(
- case class SortMergeJoin(
This patch adds the following new dependencies:
- commons-math3-3.4.1.jar
- snappy-java-1.1.1.7.jar
This patch removes the following dependencies:
- commons-math3-3.1.1.jar
- snappy-java-1.1.1.6.jar

AmplabJenkins · 2015-04-15T06:36:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30315/
Test FAILed.

SparkQA · 2015-04-15T06:37:37Z

Test build #30319 has started for PR 5208 at commit 5049d88.

SparkQA · 2015-04-15T06:39:08Z

Test build #30319 has finished for PR 5208 at commit 5049d88.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Exchange(
- case class SortMergeJoin(
This patch does not change any dependencies.

AmplabJenkins · 2015-04-15T06:39:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30319/
Test FAILed.

SparkQA · 2015-04-15T06:42:38Z

Test build #30321 has started for PR 5208 at commit 2493b9f.

SparkQA · 2015-04-15T06:57:11Z

Test build #30304 has finished for PR 5208 at commit ec8061b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Exchange(
- case class SortMergeJoin(
This patch does not change any dependencies.

AmplabJenkins · 2015-04-15T06:57:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30304/
Test PASSed.

SparkQA · 2015-04-15T08:23:15Z

Test build #30321 has finished for PR 5208 at commit 2493b9f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Exchange(
- case class SortMergeJoin(
This patch does not change any dependencies.

AmplabJenkins · 2015-04-15T08:23:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30321/
Test PASSed.

marmbrus · 2015-04-15T21:07:05Z

I manually fixed the conflicts while merging to master. Thanks! I'm excited to test out the performance of this new feature :)

adrian-wang · 2015-04-16T01:35:02Z

Thanks!

justmytwospence · 2015-06-15T23:19:08Z

Is this feature limited to equi-joins?

adrian-wang · 2015-06-16T02:12:21Z

@justmytwospence yes.

adrian-wang force-pushed the smj branch 2 times, most recently from 9220280 to cb1e18d Compare March 30, 2015 04:15

adrian-wang force-pushed the smj branch from f5f81db to 7a869c5 Compare April 2, 2015 03:16

chenghao-intel reviewed Apr 2, 2015
View reviewed changes

yhuai reviewed Apr 15, 2015
View reviewed changes

yin's comment: use external sort if option is enabled, add comments

f91a2ae

adrian-wang reviewed Apr 15, 2015
View reviewed changes

propagate rowOrdering for RangePartitioning

5049d88

fix style

2493b9f

asfgit closed this in 585638e Apr 15, 2015

adrian-wang mentioned this pull request Apr 27, 2015

[SPARK-7165] [SQL] use sort merge join for outer join #5717

Closed

JoshRosen mentioned this pull request Aug 3, 2015

[SPARK-9729] [SPARK-9363] [SQL] Use sort merge join for left and right outer join #7904

Closed

		@@ -87,7 +126,12 @@ case class Exchange(newPartitioning: Partitioning, child: SparkPlan) extends Una
		implicit val ordering = new RowOrdering(sortingExpressions, child.output)

[SPARK-2213] [SQL] sort merge join for spark sql #5208

[SPARK-2213] [SQL] sort merge join for spark sql #5208

Conversation

adrian-wang commented Mar 26, 2015

SparkQA commented Mar 26, 2015

SparkQA commented Mar 26, 2015

AmplabJenkins commented Mar 26, 2015

SparkQA commented Mar 30, 2015

SparkQA commented Mar 30, 2015

AmplabJenkins commented Mar 30, 2015

SparkQA commented Mar 30, 2015

SparkQA commented Mar 30, 2015

AmplabJenkins commented Mar 30, 2015

SparkQA commented Apr 1, 2015

SparkQA commented Apr 1, 2015

AmplabJenkins commented Apr 1, 2015

SparkQA commented Apr 1, 2015

SparkQA commented Apr 1, 2015

AmplabJenkins commented Apr 1, 2015

SparkQA commented Apr 1, 2015

SparkQA commented Apr 1, 2015

AmplabJenkins commented Apr 1, 2015

adrian-wang commented Apr 1, 2015

adrian-wang commented Apr 1, 2015

chenghao-intel commented Apr 1, 2015

adrian-wang commented Apr 1, 2015

SparkQA commented Apr 2, 2015

SparkQA commented Apr 2, 2015

AmplabJenkins commented Apr 2, 2015

adrian-wang commented Apr 2, 2015

chenghao-intel Apr 2, 2015

Choose a reason for hiding this comment

adrian-wang Apr 2, 2015

Choose a reason for hiding this comment

yhuai Apr 15, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 15, 2015

yhuai Apr 15, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 15, 2015

SparkQA commented Apr 15, 2015

AmplabJenkins commented Apr 15, 2015

adrian-wang Apr 15, 2015

Choose a reason for hiding this comment

adrian-wang Apr 15, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 15, 2015

AmplabJenkins commented Apr 15, 2015

SparkQA commented Apr 15, 2015

SparkQA commented Apr 15, 2015

AmplabJenkins commented Apr 15, 2015

SparkQA commented Apr 15, 2015

SparkQA commented Apr 15, 2015

AmplabJenkins commented Apr 15, 2015

SparkQA commented Apr 15, 2015

AmplabJenkins commented Apr 15, 2015

marmbrus commented Apr 15, 2015

adrian-wang commented Apr 16, 2015

justmytwospence commented Jun 15, 2015

adrian-wang commented Jun 16, 2015