Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash function improvements #2290

Open
Riolku opened this issue Oct 27, 2023 · 3 comments
Open

Hash function improvements #2290

Riolku opened this issue Oct 27, 2023 · 3 comments

Comments

@Riolku
Copy link
Contributor

Riolku commented Oct 27, 2023

Currently, our hash function is quite naive, and it's very easy to cause very bad collisions in the hash index.

We should replace it. DuckDB seems to be using a very simple but efficient hash function, which seems to be a good choice based on some cursory reading online.

We should also change the name since we aren't actually using murmurhash.

@Riolku Riolku self-assigned this Oct 27, 2023
@Riolku
Copy link
Contributor Author

Riolku commented Oct 30, 2023

Changing the integer hash is straightforward, but there are some major remaining questions:

  1. What do we do about our CombineHash function? DuckDB uses a simple XOR, but this means ab and ba have the same hash, which is generally undesirable, or so I've been told.
  2. Semi-related, our string hash algorithm is std::hash. Should we change this? DuckDB and Postgres are both using something super specialized.

@semihsalihoglu-uw semihsalihoglu-uw changed the title Hash function improvement Hash function improvements Jan 8, 2024
@benjaminwinger
Copy link
Collaborator

Semi-related, our string hash algorithm is std::hash. Should we change this? DuckDB and Postgres are both using something super specialized.

Simple answer is yes: #2952.
DuckDB is using a string function from here, which is actually just MurmurHash2. I did a quick benchmark and the performance is almost identical to the rolling hash implementation I wrote for #2952; though I don't know if it produces better results. Postgres seems to be using a hash function from here; the author says that murmurhash is faster, but weaker, than his hash function (the postgres version has an update from 2006, but I think we'd be better off looking at newer hash function comparisons like smhasher if we want to find something better).

Also, I think this issue was already more or less implemented in 359c09d, so maybe this could be closed after #2952 is merged.

@Riolku
Copy link
Contributor Author

Riolku commented Feb 26, 2024

Yes, integer hashing was changed by me. String hashing, seems like you have a PR for. I think the other hash functions are defined in terms of those two.

@mewim mewim unassigned Riolku Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants