Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[perf] hashing function improvements #3242

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

romgrk
Copy link
Contributor

@romgrk romgrk commented Aug 21, 2024

While I was working on #3241 I saw a few issues with your existing javascript implementation of murmur2, here are a few fixes. I've included a comparison with the WASM alternatives just for the context, but the murmur2_original and murmur2_improved entries of this summary are the important ones. As you can see, we can more than double the performance with the changes in this PR.

image

Copy link

changeset-bot bot commented Aug 21, 2024

⚠️ No Changeset found

Latest commit: 646e7ea

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link

codesandbox-ci bot commented Aug 21, 2024

This pull request is automatically built and testable in CodeSandbox.

To see build info of the built libraries, click here or the icon next to each commit SHA.

packages/hash/src/index.ts Outdated Show resolved Hide resolved
packages/hash/src/index.ts Outdated Show resolved Hide resolved
@Andarist
Copy link
Member

Have you confirmed that this performs better in all modern engines?

@romgrk
Copy link
Contributor Author

romgrk commented Aug 21, 2024

Have you confirmed that this performs better in all modern engines?

Here is a comparison. The benchmark code is available here and requires bun (for JSC) and gjs (for SpiderMonkey).

image

@Andarist
Copy link
Member

It looks good, cc @emmatown


if (input.length > bufferLength) {
// bufferLength must be a multiple of 4 to satisfy Int32Array constraints
bufferLength = input.length + (4 - input.length % 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't allocating enough space for strings where utf8 length != utf16 length and will just ignore some input at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, and there's no way to know the byte length until it has been iterated/encoded. I'll allocate 2x the space to ensure there's enough space for the worst case.

Comment on lines +16 to +31
if (hasTextEncoder === false) {
const bytes = []
for (let i = 0; i < input.length; i++) {
const codePoint = input.charCodeAt(i)
if (codePoint > 0xff) {
bytes.push(codePoint >>> 8)
bytes.push(codePoint & 0xff)
} else {
bytes.push(codePoint)
}
}
uint8View = new Uint8Array(bytes)
int32View = new Int32Array(uint8View.buffer, 0, Math.floor(bytes.length / 4))

return bytes.length
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to encode like TextEncoder since the hash needs to return the same results if there e.g. is TextEncoder when a page is server-rendered but not on the client

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return encoder!.encodeInto(input, uint8View).written as number;
}

export default function murmur2(input: string): string {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I was exploring the data a bit more I realized that the hashing function is re-hashing the same strings a lot, and that we could speed it up by caching it in a Map.

One note before the results, I was testing with a dataset made of a bunch of MUI components copy/pasted. For this benchmark I switched to a new dataset that is directly extracted from the MUI dashboard template, which I think is more realistic and reflects production setups even better.

Here are the results, which indicate a substantial improvement for that dataset when caching the murmur2 result. The implementation is very naive, and I found that trying to mutate/update the map with more complex rules actually decreased the gains by a lot, probably because adding bookkeeping on each hash iteration is too much work. For comparison I also added a case for the cached version, but with a dataset with absolutely no duplicate, which means the caching code is pure overhead.

image

Last note, for this particular dataset, unlike the last dataset, SpiderMonkey shows a small/moderate performance decrease from original to improved, not sure why. This new dataset seems to have a bit shorter string styles, might explain why. I still think original to improved is a gain considering that Firefox accounts for less than 3% of the trafic right now.

So to sum this up, caching the hash result is a gain if we assume that we're going to see a lot of duplicates following each other, which seems to be the case for MUI cases. I would love to have your opinion on this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants