-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode & Python implementation #12
Comments
Someone addressed that issue and published a revised version to NPM here. (EDIT: My bad the linked NPM package is not based on this project, it also expects you to convert strings to bytes, but provides examples how to do that). This package should work and handles the string conversion internally, it is a fork of this project repo but seems intended for NodeJS. It can be a common issue with porting hash functions, since some languages default to strings in UTF-8, while others UTF-16 (like JS). Or rather, indexing on a string in Python 3 which is treated as a unicode string will iterate through it's individual code-points(32-bit values). With Rust, you would iterate through individual UTF-8 bytes, or the 32-bit code-points via However with JS and it's UTF-16 encoded strings. You can iterate through the strings sequence of 16-bit (2 byte) values, which may come in pairs for some glyphs/characters which are known as high and low surrogates. While iterating through the string will provide each value, using For your example glyph that's a non-issue as it's only 2 bytes in size, the problem it still has is that a 16-bit value is returned while the JS hash code expects/treats it as an 8-bit value like regular ASCII text. It needs to be converted to UTF-8, which will still be 2 bytes (sometimes UTF-8 can be more bytes than UTF-16), and then it will work as expected. Fun fact: For added confusion on this topic, some glyphs (or graphemes rather) are composed of multiple unicode characters. It is not uncommon with emoji, eg
Python is correct. The The lib in python you used outputs signed integer (half the range can be a negative number); if output as a unsigned integer instead it would be always positive numbers (which this project does). There is no difference in the actual bytes (hash result), just how the number is interpreted. Thus both of the following are correct:
The reason why the JS version fails is because as mentioned, JS is working with a UTF-16 string but doesn't iterate through bytes as intended, thus it gets the wrong values to calculate with: murmurhash-js/murmurhash3_gc.js Line 47 in 0197ce3
Like other lines there, the values from the string are taken with
You can correctly convert the input string for processing this way: Doing so I was able to get the value I could also reproduce the incorrect hash you provided in a rust implementation of murmur3 by using the 16-bit encoding |
I'm wondering how to get the same hash for a unicode string in my browser using this function vs. using the same function in Python https://pypi.org/project/murmurhash/ (where it operates either on unicode or bytes).
Python:
I'm wondering which implementation is correct, or if there's anything I need to do to either to get the output to match up?
The text was updated successfully, but these errors were encountered: