[SPARK-32767][SQL] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number #29612

wangyum · 2020-09-01T16:53:00Z

What changes were proposed in this pull request?

Bucket join should work if spark.sql.shuffle.partitions larger than bucket number, such as:

spark.range(1000).write.bucketBy(432, "id").saveAsTable("t1")
spark.range(1000).write.bucketBy(34, "id").saveAsTable("t2")
sql("set spark.sql.shuffle.partitions=600")
sql("set spark.sql.autoBroadcastJoinThreshold=-1")
sql("select * from t1 join t2 on t1.id = t2.id").explain()

Before this pr:

== Physical Plan ==
*(5) SortMergeJoin [id#26L], [id#27L], Inner
:- *(2) Sort [id#26L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(id#26L, 600), true
:     +- *(1) Filter isnotnull(id#26L)
:        +- *(1) ColumnarToRow
:           +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: [isnotnull(id#26L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 432 out of 432
+- *(4) Sort [id#27L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#27L, 600), true
      +- *(3) Filter isnotnull(id#27L)
         +- *(3) ColumnarToRow
            +- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: [isnotnull(id#27L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 34 out of 34

After this pr:

== Physical Plan ==
*(4) SortMergeJoin [id#26L], [id#27L], Inner
:- *(1) Sort [id#26L ASC NULLS FIRST], false, 0
:  +- *(1) Filter isnotnull(id#26L)
:     +- *(1) ColumnarToRow
:        +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: [isnotnull(id#26L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 432 out of 432
+- *(3) Sort [id#27L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#27L, 432), true
      +- *(2) Filter isnotnull(id#27L)
         +- *(2) ColumnarToRow
            +- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: [isnotnull(id#27L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 34 out of 34

Why are the changes needed?

Spark 2.4 support this.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

SparkQA · 2020-09-01T21:33:31Z

Test build #128155 has finished for PR 29612 at commit e59df34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-02T04:16:18Z

thanks, merging to master!

…partitions larger than bucket number This backports #29612 to branch-3.0. Original PR description: ### What changes were proposed in this pull request? Bucket join should work if `spark.sql.shuffle.partitions` larger than bucket number, such as: ```scala spark.range(1000).write.bucketBy(432, "id").saveAsTable("t1") spark.range(1000).write.bucketBy(34, "id").saveAsTable("t2") sql("set spark.sql.shuffle.partitions=600") sql("set spark.sql.autoBroadcastJoinThreshold=-1") sql("select * from t1 join t2 on t1.id = t2.id").explain() ``` Before this pr: ``` == Physical Plan == *(5) SortMergeJoin [id#26L], [id#27L], Inner :- *(2) Sort [id#26L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#26L, 600), true : +- *(1) Filter isnotnull(id#26L) : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: [isnotnull(id#26L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 432 out of 432 +- *(4) Sort [id#27L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#27L, 600), true +- *(3) Filter isnotnull(id#27L) +- *(3) ColumnarToRow +- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: [isnotnull(id#27L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 34 out of 34 ``` After this pr: ``` == Physical Plan == *(4) SortMergeJoin [id#26L], [id#27L], Inner :- *(1) Sort [id#26L ASC NULLS FIRST], false, 0 : +- *(1) Filter isnotnull(id#26L) : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: [isnotnull(id#26L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 432 out of 432 +- *(3) Sort [id#27L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#27L, 432), true +- *(2) Filter isnotnull(id#27L) +- *(2) ColumnarToRow +- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: [isnotnull(id#27L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 34 out of 34 ``` ### Why are the changes needed? Spark 2.4 support this. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #29624 from wangyum/SPARK-32767-3.0. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>

…partitions larger than bucket number This backports apache#29612 to branch-3.0. Original PR description: ### What changes were proposed in this pull request? Bucket join should work if `spark.sql.shuffle.partitions` larger than bucket number, such as: ```scala spark.range(1000).write.bucketBy(432, "id").saveAsTable("t1") spark.range(1000).write.bucketBy(34, "id").saveAsTable("t2") sql("set spark.sql.shuffle.partitions=600") sql("set spark.sql.autoBroadcastJoinThreshold=-1") sql("select * from t1 join t2 on t1.id = t2.id").explain() ``` Before this pr: ``` == Physical Plan == *(5) SortMergeJoin [id#26L], [id#27L], Inner :- *(2) Sort [id#26L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#26L, 600), true : +- *(1) Filter isnotnull(id#26L) : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: [isnotnull(id#26L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 432 out of 432 +- *(4) Sort [id#27L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#27L, 600), true +- *(3) Filter isnotnull(id#27L) +- *(3) ColumnarToRow +- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: [isnotnull(id#27L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 34 out of 34 ``` After this pr: ``` == Physical Plan == *(4) SortMergeJoin [id#26L], [id#27L], Inner :- *(1) Sort [id#26L ASC NULLS FIRST], false, 0 : +- *(1) Filter isnotnull(id#26L) : +- *(1) ColumnarToRow : +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: [isnotnull(id#26L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 432 out of 432 +- *(3) Sort [id#27L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#27L, 432), true +- *(2) Filter isnotnull(id#27L) +- *(2) ColumnarToRow +- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: [isnotnull(id#27L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 34 out of 34 ``` ### Why are the changes needed? Spark 2.4 support this. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes apache#29624 from wangyum/SPARK-32767-3.0. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>

Bucket join should work if SHUFFLE_PARTITIONS larger than bucket number

e59df34

cloud-fan approved these changes Sep 1, 2020

View reviewed changes

cloud-fan closed this in 54348db Sep 2, 2020

wangyum mentioned this pull request Sep 2, 2020

[SPARK-32767][SQL][3.0] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number #29624

Closed

sunchao mentioned this pull request Dec 22, 2021

[SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution #32875

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32767][SQL] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number #29612

[SPARK-32767][SQL] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number #29612

wangyum commented Sep 1, 2020 •

edited

Loading

SparkQA commented Sep 1, 2020

cloud-fan commented Sep 2, 2020 •

edited

Loading

[SPARK-32767][SQL] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number #29612

[SPARK-32767][SQL] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number #29612

Conversation

wangyum commented Sep 1, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Sep 1, 2020

cloud-fan commented Sep 2, 2020 • edited Loading

wangyum commented Sep 1, 2020 •

edited

Loading

cloud-fan commented Sep 2, 2020 •

edited

Loading