-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-3432][VL] Add support for BroadcastNestedLoopJoinExec #4565
Conversation
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Surbhi-Vijay for working on this.
And broadcast's validation is now tricky. I'd suggest doing the work based on this ongoing patch.
Thanks @zhztheplayer for mentioning this. I will review this patch and change the current PR accordingly. |
@Surbhi-Vijay #4544 was merged. Do you have a plan to continue this work? Thanks. |
Yes @zhztheplayer, I am working on it and will publish it soon in the current week. |
ef9b75b
to
4049c36
Compare
Run Gluten Clickhouse CI |
*/ | ||
while (rowIterator.hasNext) { | ||
val unsafeRow = rowIterator.next().asInstanceOf[UnsafeRow] | ||
rowArray.append(unsafeRow.copy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhztheplayer can you please review this change and suggest if there is any better way to resolve it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Array[InternalRow]
requires all rows to be materialized so I think it's fine to use .copy()
in the initial version of this feature. We can still find a way to optimize the code later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And thank you for adding the detailed explanation in comment :)
Run Gluten Clickhouse CI |
2 similar comments
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
e7c1ce3
to
7dd7699
Compare
Run Gluten Clickhouse CI |
2 similar comments
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
I will fix the testcases and publish again. |
7b25883
to
37e2e33
Compare
Run Gluten Clickhouse CI |
1 similar comment
Run Gluten Clickhouse CI |
4da2eb2
to
7583869
Compare
Run Gluten Clickhouse CI |
2 similar comments
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
4631e7b
to
1fd7294
Compare
Run Gluten Clickhouse CI |
backends-velox/src/main/scala/io/glutenproject/backendsapi/velox/SparkPlanExecApiImpl.scala
Show resolved
Hide resolved
...-core/src/main/scala/io/glutenproject/execution/BroadcastNestedLoopJoinExecTransformer.scala
Outdated
Show resolved
Hide resolved
assertJoinBuildSide("SELECT /*+ MAPJOIN(t1, t2) */ * FROM t1 JOIN t2", blt, BuildLeft) | ||
// FULL JOIN && t1Size < t2Size => BuildLeft | ||
assertJoinBuildSide("SELECT /*+ MAPJOIN(t1, t2) */ * FROM t1 FULL JOIN t2", bl, BuildLeft) | ||
// FULL OUTER && t1Size < t2Size => BuildLeft | ||
assertJoinBuildSide("SELECT * FROM t1 FULL OUTER JOIN t2", bl, BuildLeft) | ||
// LEFT JOIN => BuildRight | ||
assertJoinBuildSide("SELECT /*+ MAPJOIN(t1, t2) */ * FROM t1 LEFT JOIN t2", bl, BuildRight) | ||
assertJoinBuildSide("SELECT /*+ MAPJOIN(t1, t2) */ * FROM t1 LEFT JOIN t2", blt, BuildRight) | ||
// RIGHT JOIN => BuildLeft | ||
assertJoinBuildSide("SELECT /*+ MAPJOIN(t1, t2) */ * FROM t1 RIGHT JOIN t2", bl, BuildLeft) | ||
assertJoinBuildSide("SELECT /*+ MAPJOIN(t1, t2) */ * FROM t1 RIGHT JOIN t2", blt, BuildLeft) | ||
|
||
/* #### test with broadcast hint #### */ | ||
// INNER JOIN && broadcast(t1) => BuildLeft | ||
assertJoinBuildSide("SELECT /*+ MAPJOIN(t1) */ * FROM t1 JOIN t2", bl, BuildLeft) | ||
assertJoinBuildSide("SELECT /*+ MAPJOIN(t1) */ * FROM t1 JOIN t2", blt, BuildLeft) | ||
// INNER JOIN && broadcast(t2) => BuildRight | ||
assertJoinBuildSide("SELECT /*+ MAPJOIN(t2) */ * FROM t1 JOIN t2", bl, BuildRight) | ||
assertJoinBuildSide("SELECT /*+ MAPJOIN(t2) */ * FROM t1 JOIN t2", blt, BuildRight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the statements check on physical plan only? Do we need test with result-checking to cover the added code?
I see we had enabled Gluten...JoinSuite
s but I am not sure if the test cases work for Gluten's BNLJ either. Would you like to help check that? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GlutenInnerJoinSuite
and GlutenOuterJoinSuite
are enabled in the Gluten. Both of these suites check the answer of test cases. In these test cases, explicit joins are selected and since now we have added BNLJTransformer
, BNLJ is getting transformed into Gluten for inner
, leftOuter with BuildRight
and RightOuter with BuildLeft
joins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here in GlutenBroadcastJoinSuite
, it is not checking the answer and only checking physical plan. This test case is a rewrite of spark test case in Gluten and I had to make changes in to get it passed in pipeline.
please let me know if I should add more test cases.
The few testcases I can think of is to add fallback test cases for other remaining join types i.e. FullOuter
, LeftOuter with BuildLeft
& RightOuter with BuildRight
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your explanation. If GlutenBroadcastJoinSuite
mainly checks against plan then we don't have to add answer-checking code in this suite. I was just worried whether we actually have at least some code to verify the correctness of the patch's code. If GlutenInnerJoinSuite
GlutenOuterJoinSuite
do work for BNLJ transformer, then it's fine enough to me.
The few testcases I can think of is to add fallback test cases for other remaining join types i.e. FullOuter, LeftOuter with BuildLeft & RightOuter with BuildRight.
That's great idea. You can decide whether to put code in this PR or in a new one. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhztheplayer I will create a follow up PR covering below:
- Updating documentation
- Adding testcases for fallback scenarios
- Renaming broadcast API methods to generic names for readability.
val broadcastRDD = { | ||
val executionId = sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY) | ||
BackendsApiManager.getBroadcastApiInstance | ||
.collectExecutionBroadcastHashTableId(executionId, buildTableId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhztheplayer can I rename methods under BroadcastApi
to make it more generic? Since now this is being used for both the modes i.e. hashedBroadcastMode
& IdentityBroadcastMode
. These method names containing Hashtable
might confuse other developers who may read this file later.
Rename collectExecutionBroadcastHashTableId
=> collectExecutionBroadcastTableId
Rename cleanExecutionBroadcastHashtable
=> cleanExecutionBroadcastTable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename collectExecutionBroadcastHashTableId => collectExecutionBroadcastTableId
Rename cleanExecutionBroadcastHashtable=> cleanExecutionBroadcastTable
Sounds fair to me. Thanks.
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
…BNLJTransformer
Run Gluten Clickhouse CI |
@zhztheplayer @zhouyuan I have rebased this PR again, can you please review and provide feedback if there are any? I will create follow up PR for remaining small task items like documentation, renaming broadcast api common methods for clarity etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@zhouyuan @zhztheplayer Can we please merge this PR? |
===== Performance report for TPCH SF2000 with Velox backend, for reference only ====
|
What changes were proposed in this pull request?
Added support for BroadcastNestedLoopJoinTransformer
Supported join types => Inner, Cross, LeftOuter with BuildRight, RightOuter with BuildLeft
(Fixes: #3432)
How was this patch tested?
Tests in GlutenInnerJoin, GlutenOuterJoin and Various other tests already covers it.
Follow Up: