-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2213] [SQL] sort merge join for spark sql #5208
Conversation
Test build #29224 has started for PR 5208 at commit
|
Test build #29224 has finished for PR 5208 at commit
|
Test FAILed. |
9220280
to
cb1e18d
Compare
Test build #29382 has started for PR 5208 at commit |
Test build #29382 has finished for PR 5208 at commit
|
Test FAILed. |
Test build #29383 has started for PR 5208 at commit |
Test build #29383 has finished for PR 5208 at commit
|
Test FAILed. |
Test build #29530 has started for PR 5208 at commit |
Test build #29530 has finished for PR 5208 at commit
|
Test FAILed. |
Test build #29532 has started for PR 5208 at commit |
Test build #29532 has finished for PR 5208 at commit
|
Test FAILed. |
Test build #29533 has started for PR 5208 at commit |
Test build #29533 has finished for PR 5208 at commit
|
Test FAILed. |
I am not getting this error locally... what's wrong? |
This exception only exists on current master, I didn't get this locally because I was working on a March-26 master. This could be a potential bug we introduced during this period. |
From the log, seems the output fields of the
|
yes, after rebase i can see this exception |
Test build #29584 has started for PR 5208 at commit |
Test build #29584 has finished for PR 5208 at commit
|
Test PASSed. |
* By default it will choose sort merge join. | ||
*/ | ||
private[spark] def autoSortMergeJoin: Boolean = | ||
getConf(AUTO_SORTMERGEJOIN, true.toString).toBoolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it false
as default, the SMJ should be experimental feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, just use true for Jenkins testing.
|
||
if (meetsRequirements && compatible) { | ||
val withSort = if (needSort) { | ||
Sort(rowOrdering, global = false, withShuffle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like what we do in SparkStrategies
, use execution.ExternalSort
when sqlContext.conf.externalSortEnabled
is true
.
Test build #30309 has started for PR 5208 at commit |
case (UnspecifiedDistribution, Seq(), child) => | ||
child | ||
case (UnspecifiedDistribution, rowOrdering, child) => | ||
Sort(rowOrdering, global = false, child) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use execution.ExternalSort
when sqlContext.conf.externalSortEnabled
is true.
Test build #30315 has started for PR 5208 at commit |
Test build #30309 has finished for PR 5208 at commit
|
Test FAILed. |
@@ -87,7 +126,12 @@ case class Exchange(newPartitioning: Partitioning, child: SparkPlan) extends Una | |||
implicit val ordering = new RowOrdering(sortingExpressions, child.output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this line is redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see... For RangePartitioner..
Test build #30315 has finished for PR 5208 at commit
|
Test FAILed. |
Test build #30319 has started for PR 5208 at commit |
Test build #30319 has finished for PR 5208 at commit
|
Test FAILed. |
Test build #30321 has started for PR 5208 at commit |
Test build #30304 has finished for PR 5208 at commit
|
Test PASSed. |
Test build #30321 has finished for PR 5208 at commit
|
Test PASSed. |
I manually fixed the conflicts while merging to master. Thanks! I'm excited to test out the performance of this new feature :) |
Thanks! |
Is this feature limited to equi-joins? |
@justmytwospence yes. |
Thanks for the initial work from @Ishiihara in #3173
This PR introduce a new join method of sort merge join, which firstly ensure that keys of same value are in the same partition, and inside each partition the Rows are sorted by key. Then we can run down both sides together, find matched rows using sort merge join. In this way, we don't have to store the whole hash table of one side as hash join, thus we have less memory usage. Also, this PR would benefit from #3438 , making the sorting phrase much more efficient.
We introduced a new configuration of "spark.sql.planner.sortMergeJoin" to switch between this(
true
) and ShuffledHashJoin(false
), probably we want the default value of it befalse
at first.