Consolidate MergeJoin with HashJoin to adaptive join relations according to runtime resources and table sizes #2316

yjshen · 2022-04-22T07:34:59Z

A possible solution I could think of currently:

Always choose to use HashJoin when there is no statistical information indicating that both tables are large.
Memory tracking while building hashtable for building side.
When the hash-builder fails to grow its memory
3.1. sort and spill the in-memory hashtable into spill0, free memory.
3.2. buffer and sort the incoming records for the buffer table until it's exhausted, do a sort.
3.3. buffer and sort the records for the streaming side until it's finished, do a sort.
3.4 MergeJoin the two sides.

yjshen mentioned this issue Apr 22, 2022

Memory Limited Joins (Externalized / Spill) #1599

Open

5 tasks

Provide feedback