Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-32399][SQL] Full outer shuffled hash join
### What changes were proposed in this pull request? Add support for full outer join inside shuffled hash join. Currently if the query is a full outer join, we only use sort merge join as the physical operator. However it can be CPU and IO intensive in case input table is large for sort merge join. Shuffled hash join on the other hand saves the sort CPU and IO compared to sort merge join, especially when table is large. This PR implements the full outer join as followed: * Process rows from stream side by looking up hash relation, and mark the matched rows from build side by: * for joining with unique key, a `BitSet` is used to record matched rows from build side (`key index` to represent each row) * for joining with non-unique key, a `HashSet[Long]` is used to record matched rows from build side (`key index` + `value index` to represent each row). `key index` is defined as the index into key addressing array `longArray` in `BytesToBytesMap`. `value index` is defined as the iterator index of values for same key. * Process rows from build side by iterating hash relation, and filter out rows from build side being looked up already (done in `ShuffledHashJoinExec.fullOuterJoin`) For context, this PR was originally implemented as followed (up to commit e332276): 1. Construct hash relation from build side, with extra boolean value at the end of row to track look up information (done in `ShuffledHashJoinExec.buildHashedRelation` and `UnsafeHashedRelation.apply`). 2. Process rows from stream side by looking up hash relation, and mark the matched rows from build side be looked up (done in `ShuffledHashJoinExec.fullOuterJoin`). 3. Process rows from build side by iterating hash relation, and filter out rows from build side being looked up already (done in `ShuffledHashJoinExec.fullOuterJoin`). See discussion of pros and cons between these two approaches [here](#29342 (comment)), [here](#29342 (comment)) and [here](#29342 (comment)). TODO: codegen for full outer shuffled hash join can be implemented in another followup PR. ### Why are the changes needed? As implementation in this PR, full outer shuffled hash join will have overhead to iterate build side twice (once for building hash map, and another for outputting non-matching rows), and iterate stream side once. However, full outer sort merge join needs to iterate both sides twice, and sort the large table can be more CPU and IO intensive. So full outer shuffled hash join can be more efficient than sort merge join when stream side is much more larger than build side. For example query below, full outer SHJ saved 30% wall clock time compared to full outer SMJ. ``` def shuffleHashJoin(): Unit = { val N: Long = 4 << 22 withSQLConf( SQLConf.SHUFFLE_PARTITIONS.key -> "2", SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "20000000") { codegenBenchmark("shuffle hash join", N) { val df1 = spark.range(N).selectExpr(s"cast(id as string) as k1") val df2 = spark.range(N / 10).selectExpr(s"cast(id * 10 as string) as k2") val df = df1.join(df2, col("k1") === col("k2"), "full_outer") df.noop() } } } ``` ``` Running benchmark: shuffle hash join Running case: shuffle hash join off Stopped after 2 iterations, 16602 ms Running case: shuffle hash join on Stopped after 5 iterations, 31911 ms Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz shuffle hash join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ shuffle hash join off 7900 8301 567 2.1 470.9 1.0X shuffle hash join on 6250 6382 95 2.7 372.5 1.3X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `JoinSuite.scala`, `AbstractBytesToBytesMapSuite.java` and `HashedRelationSuite.scala`. Closes #29342 from c21/full-outer-shj. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
- Loading branch information