Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsider hashing of nulls #822

Open
alamb opened this issue Aug 4, 2021 · 0 comments
Open

Reconsider hashing of nulls #822

alamb opened this issue Aug 4, 2021 · 0 comments
Labels
datafusion Changes in the datafusion crate enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Aug 4, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The create_hash function is responsible for hashing values in arrays. At the moment, however, it (effectively) hashes NULL values to 0 for all types, which likely leads to sub optimial behavior such as @Dandandan observed in #812 (comment) that NULL,1 and 1,NULL will hash to the same value.

Describe the solution you'd like
TBD

Describe alternatives you've considered
@jorgecarleitao 's comment (copied below) from #790 (comment) offers a few alternatives:

From the hashing side, an unknown to me atm is how to efficiently hash values+validity. I.e. given V = ["a", "", "c"] and N = [true, false, true], I see some options:

  • hash(V) ^ !N + unique * N where unique is a unique sentinel value exclusive for null values. If hash is vectorized, this operation is vectorized.

  • concat(hash(value), is_valid) for value, is_valid in zip(V,N)

  • split the array between nulls and not nulls, i.e. N -> (non-null indices, null indices), perform hashing over valid indices only, and then, at the very end, append all values for the nulls. We do this in the sort kernel, to reduce the number of slots to perform comparisons over.

If we could write the code in a way that we could "easily" switch between implementations (during dev only, not a conf parameter), we could bench whether one wins over the other, or under which circumstances.

Additional context
Add any other context or screenshots about the feature request here.

@alamb alamb added the enhancement New feature or request label Aug 4, 2021
@alamb alamb added the datafusion Changes in the datafusion crate label Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant