Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash join with direct indexing #816

Open
Dandandan opened this issue Aug 2, 2021 · 1 comment
Open

Hash join with direct indexing #816

Dandandan opened this issue Aug 2, 2021 · 1 comment
Labels
enhancement New feature or request performance Make DataFusion faster

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Aug 2, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When the range of the join keys is small and we have integer keys, instead of using a hashtable we can use a array/vec which starts on the minimum value of the values in the build-side key range.

Idea from:
duckdb/duckdb#1959

Describe the solution you'd like
Filter on

  • join constraint should be on single integer (boolean/byte also possible, as long as it is limited in domain) keys.
  • no duplicates (?) - not sure whether its required
  • If no duplicates - > we can see how much unique values we have -> should be less than max limit, e.g. less than 100K values.
  • Get statistics on min/max values. It should be a small enough build-side (e.g. max 100K key values)

If this is all true we can copy the offsets to somethig like a Vec<Option<u64>> (or arrow equivalent) which contains the offsets starting at the minimum offset at index 0. Each element is indexed by x - MIN(x).

Instead of hashing the right-side values and probing, we can compute y - MIN(x) for the right side and index directly into the array. Checking hash collisions is not necessary for this approach.

Describe alternatives you've considered

Additional context

@Dandandan Dandandan added enhancement New feature or request performance Make DataFusion faster labels Aug 2, 2021
@alamb
Copy link
Contributor

alamb commented Aug 4, 2021

this is a cool idea (especially using statistics to switch it on)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Make DataFusion faster
Projects
None yet
Development

No branches or pull requests

2 participants