[SPARK-9729] [SPARK-9363] [SQL] Use sort merge join for left and right outer join #7904

JoshRosen · 2015-08-03T21:11:20Z

This patch adds a new SortMergeOuterJoin operator that performs left and right outer joins using sort merge join. It also refactors SortMergeJoin in order to improve performance and code clarity.

Along the way, I also performed a couple pieces of minor cleanup and optimization:

Rename the HashJoin physical planner rule to EquiJoinSelection, since it's also used for non-hash joins.
Rewrite the comment at the top of HashJoin to better explain the precedence for choosing join operators.
Update JoinSuite to use SqlTestUtils.withConf for changing SQLConf settings.

This patch incorporates several ideas from @adrian-wang's patch, #5717.

Closes #5717.

JoshRosen · 2015-08-04T04:28:19Z

Current plan is to have separate iterators for left, right and full outer join, with some possible code-reuse / sharing of the iterators defined in the HashOuterJoin trait (I'll move them elsewhere). The key idea here is that once you've constructed the buffer for half of the left outer join then it doesn't really matter whether that buffer came from a hash map or was built up by scanning over the other sorted input. This should substantially reduce code complexity and will make it easier to spot the functionality which is only used for full outer join.

I'll work on implementing this design tomorrow.

JoshRosen · 2015-08-04T22:54:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeOuterJoin.scala

+    leftResults.zipPartitions(rightResults) { (leftIter, rightIter) =>
+      joinType match {
+        case LeftOuter =>
+          // TODO(josh): for SMJ we would buffer keys here:


For now, this class just copies ShuffledHashJoin; I'm going to edit it now to take advantage of the fact that the inputs are sorted.

JoshRosen · 2015-08-04T23:01:41Z

I think that this could also benefit from randomized agreement tests, using SparkPlanTest. I'll look into adding a new OuterJoinSuite to do this (this could also be done as a followup during the QA period).

JoshRosen · 2015-08-04T23:28:02Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoin.scala

@@ -61,6 +61,7 @@ case class SortMergeJoin(
    keys.map(SortOrder(_, Ascending))

  protected override def doExecute(): RDD[InternalRow] = {
+    // TODO(josh): why is this copying necessary?
    val leftResults = left.execute().map(_.copy())


I noticed that SortMergeJoin has this defensive copying on both inputs. I think that this is overly-conservative: we should only need to copy UnsafeRows rows that might be buffered and we should be able to perform that copying at the last possible moment when inserting the rows into the buffers. This means that the stream side of a left or right outer join should not need to be copied.

JoshRosen · 2015-08-05T09:12:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/OuterJoin.scala

  }

  protected[this] def isUnsafeMode: Boolean = {
+    // TODO(josh): there is an existing bug here: this should also check whether unsafe mode
+    // is enabled. also, the default for self.codegenEnabled looks inconsistent to me.
    (self.codegenEnabled && joinType != FullOuter


Pretty sure there's a bug here (see above comment): if unsafe is disabled then we should never generate unsafe projections.

This will be addressed by @davies' patch to consolidate the Unsafe and Codegen configurations.

…Ordering in more places.

Previously, the planner would always choose sort-merge-join for outer joins, even in cases where broadcast outer join could be used.

SparkQA · 2015-08-09T02:51:45Z

Test build #40252 has finished for PR 7904 at commit f701652.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-08-09T05:29:41Z

Alright, looks like this is passing tests even with the defensive copying disabled, so I'm going to go ahead and clean up this patch to get it into a merge-ready state.

SparkQA · 2015-08-09T09:41:17Z

Test build #40266 has finished for PR 7904 at commit f83b412.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-08-10T20:33:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeOuterJoin.scala

+
+  override def outputOrdering: Seq[SortOrder] = joinType match {
+    // For left and right outer joins, the output is ordered by the streamed input's join keys.
+    case LeftOuter => requiredOrders(leftKeys)


Quick question about this, actually: if the join keys contain nulls then a left or right outer join may output rows with null join keys. Does this have any impact on the outputOrdering (e.g. is it safe to say that it's still ordered by the left keys if those columns are nullable in the output)? Presumably this is safe, since those nulls were also ordered in the input, but I just want to confirm. @yhuai?

Yeah, we can say output rows are order by left keys (as long as our sort operator groups them together). Rows with null right keys will not be grouped.

JoshRosen · 2015-08-10T22:05:36Z

I've pulled #7985 into this patch in order to improve the unit test coverage of our physical join operators, including this new SMJ outer join.

yhuai · 2015-08-10T23:27:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeOuterJoin.scala

+
+  private def advanceRightUntilBoundConditionSatisfied(): Boolean = {
+    var foundMatch: Boolean = false
+    if (!foundMatch && rightIdx < smjScanner.getBufferedMatches.length) {


Looks like we need to use while loop at here?

yhuai · 2015-08-10T23:31:48Z

It will be helpful to add comments to explain the flow of these two join operators. The flow is not obvious from the code.

SparkQA · 2015-08-11T01:16:46Z

Test build #40345 has finished for PR 7904 at commit 899dce2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SortMergeOuterJoin(

SparkQA · 2015-08-11T02:06:20Z

Test build #40361 has finished for PR 7904 at commit e79909e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SortMergeOuterJoin(

JoshRosen · 2015-08-11T02:59:04Z

Jenkins, retest this please.

yhuai · 2015-08-11T03:51:39Z

I think the the workflow of these join operators is correct. We can merge it once it passes the tests. I will do a post-hoc then.

SparkQA · 2015-08-11T04:56:15Z

Test build #40383 has finished for PR 7904 at commit eabacca.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SortMergeOuterJoin(

…t outer join This patch adds a new `SortMergeOuterJoin` operator that performs left and right outer joins using sort merge join. It also refactors `SortMergeJoin` in order to improve performance and code clarity. Along the way, I also performed a couple pieces of minor cleanup and optimization: - Rename the `HashJoin` physical planner rule to `EquiJoinSelection`, since it's also used for non-hash joins. - Rewrite the comment at the top of `HashJoin` to better explain the precedence for choosing join operators. - Update `JoinSuite` to use `SqlTestUtils.withConf` for changing SQLConf settings. This patch incorporates several ideas from adrian-wang's patch, #5717. Closes #5717.  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7904)  Author: Josh Rosen <joshrosen@databricks.com> Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #7904 from JoshRosen/outer-join-smj and squashes 1 commits. (cherry picked from commit 91e9389) Signed-off-by: Reynold Xin <rxin@databricks.com>

…t outer join This patch adds a new `SortMergeOuterJoin` operator that performs left and right outer joins using sort merge join. It also refactors `SortMergeJoin` in order to improve performance and code clarity. Along the way, I also performed a couple pieces of minor cleanup and optimization: - Rename the `HashJoin` physical planner rule to `EquiJoinSelection`, since it's also used for non-hash joins. - Rewrite the comment at the top of `HashJoin` to better explain the precedence for choosing join operators. - Update `JoinSuite` to use `SqlTestUtils.withConf` for changing SQLConf settings. This patch incorporates several ideas from adrian-wang's patch, apache#5717. Closes apache#5717.  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/7904)  Author: Josh Rosen <joshrosen@databricks.com> Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes apache#7904 from JoshRosen/outer-join-smj and squashes 1 commits.

adrian-wang mentioned this pull request Aug 4, 2015

[SPARK-7165] [SQL] use sort merge join for outer join #5717

Closed

JoshRosen reviewed Aug 4, 2015
View reviewed changes

JoshRosen force-pushed the outer-join-smj branch from bda0101 to dd8a94e Compare August 5, 2015 06:06

JoshRosen reviewed Aug 5, 2015
View reviewed changes

JoshRosen and others added 14 commits August 5, 2015 13:43

[SPARK-9054] [SQL] Rename RowOrdering to InterpretedOrdering; use new…

be19a0f

…Ordering in more places.

Import ordering

34b8e0c

Add comment RE: Ascending ordering

e610655

Squash @adrian-wang's changes.

df88548

Remove old TODO

58edb2e

Use withSQLConf in JoinSuite

9faa2ee

Use explicit toScala conversions in ShuffledHashOuterJoin.

8d83e15

Revert changes to SortMergeJoin; add new SortMergeOuterJoin operator

a471a6e

Fix join operator selection for outer join:

cf8c042

Previously, the planner would always choose sort-merge-join for outer joins, even in cases where broadcast outer join could be used.

Rename HashOuterJoin to OuterJoin.

a09d6e3

Clean up non-obvious side-effect in JoinedRow.with[Left|Right]

58b2d1c

Style cleanup in flatMap; use curly braces instead of parens.

07ef478

Move initialize() definition closer to usage.

c3c7ed4

Large refactoring of SMJ internals to improve clarity.

78714dd

JoshRosen added 2 commits August 9, 2015 00:26

It turns out that the copy is unnecessary.

7d3cc5d

Push null check into buffered iterator next().

f83b412

JoshRosen reviewed Aug 10, 2015
View reviewed changes

JoshRosen force-pushed the outer-join-smj branch from 48e49b9 to f83b412 Compare August 10, 2015 21:57

Improve unit test coverage of join physical operators.

81956b0

Expand test data to cover multiple buffered rows per group.

899dce2

yhuai reviewed Aug 10, 2015
View reviewed changes

JoshRosen added 3 commits August 10, 2015 17:18

Fix parallelism in join operator unit tests.

e79909e

Add regression test exposing bug with missing while loop

5c34f75

Fix while loops while adding regression tests.

c188a21

comment updates

eabacca

asfgit closed this in 91e9389 Aug 11, 2015

JoshRosen deleted the outer-join-smj branch October 16, 2015 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-9729] [SPARK-9363] [SQL] Use sort merge join for left and right outer join #7904

[SPARK-9729] [SPARK-9363] [SQL] Use sort merge join for left and right outer join #7904

JoshRosen commented Aug 3, 2015

JoshRosen commented Aug 4, 2015

JoshRosen Aug 4, 2015

JoshRosen commented Aug 4, 2015

JoshRosen Aug 4, 2015

JoshRosen Aug 5, 2015

JoshRosen Aug 7, 2015

SparkQA commented Aug 9, 2015

JoshRosen commented Aug 9, 2015

SparkQA commented Aug 9, 2015

JoshRosen Aug 10, 2015

yhuai Aug 10, 2015

JoshRosen commented Aug 10, 2015

yhuai Aug 10, 2015

yhuai commented Aug 10, 2015

SparkQA commented Aug 11, 2015

SparkQA commented Aug 11, 2015

JoshRosen commented Aug 11, 2015

yhuai commented Aug 11, 2015

SparkQA commented Aug 11, 2015

[SPARK-9729] [SPARK-9363] [SQL] Use sort merge join for left and right outer join #7904

[SPARK-9729] [SPARK-9363] [SQL] Use sort merge join for left and right outer join #7904

Conversation

JoshRosen commented Aug 3, 2015

JoshRosen commented Aug 4, 2015

JoshRosen Aug 4, 2015

Choose a reason for hiding this comment

JoshRosen commented Aug 4, 2015

JoshRosen Aug 4, 2015

Choose a reason for hiding this comment

JoshRosen Aug 5, 2015

Choose a reason for hiding this comment

JoshRosen Aug 7, 2015

Choose a reason for hiding this comment

SparkQA commented Aug 9, 2015

JoshRosen commented Aug 9, 2015

SparkQA commented Aug 9, 2015

JoshRosen Aug 10, 2015

Choose a reason for hiding this comment

yhuai Aug 10, 2015

Choose a reason for hiding this comment

JoshRosen commented Aug 10, 2015

yhuai Aug 10, 2015

Choose a reason for hiding this comment

yhuai commented Aug 10, 2015

SparkQA commented Aug 11, 2015

SparkQA commented Aug 11, 2015

JoshRosen commented Aug 11, 2015

yhuai commented Aug 11, 2015

SparkQA commented Aug 11, 2015