-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient trie lookup for boolean Unicode properties #33098
Conversation
Replace binary search of ranges with trie lookup using leaves of 64-bit bitmap chunks. Benchmarks suggest this is approximately 10x faster than the bsearch approach.
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @nrc (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
@SimonSapin who I discussed this with on irc. |
Somehow got in my head that >> 8 was the right shift for a chunk of 64. Oops, sorry.
Neat! Out of curiosity, do you know what the impact this has on the compiled table sizes? (if any) |
@alexcrichton A very slight uptick, but almost a wash. By my calculations, the old table size is 29288 bytes and the new is 30451. If it were a concern, it'd be possible to get that back by special casing the sets with a small max value, but honestly I'd say code simplicity is more important. |
What is the perf impact of switching to (c >> 6) - 0x20 and (c >> 12) - 0x10? It would reduce the space used a bit, but I don't know if it would lose too much of the perf gain. |
@samlh I just measured it, the performance difference is imperceptible. This makes sense, the subtraction probably happens in parallel with getting the address of r2 into a register. Probably save 500 bytes, I think there are 11 tables. Happy to make the change if people feel like it's worthwhile. |
@@ -307,12 +307,114 @@ def emit_table(f, name, t_data, t_type = "&'static [(char, char)]", is_pub=True, | |||
format_table_content(f, data, 8) | |||
f.write("\n ];\n\n") | |||
|
|||
def emit_trie_lookup_range_table(f): | |||
f.write(""" | |||
pub struct BoolTrie { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be great to have an high-level overview of what all these fields represent/how it actually represents the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, with some comments I think this is good to go though
Adds a comment which explains the trie structure, and also does a little arithmetic on lookup (no measurable impact, looks like modern CPUs do this arithmetic in parallel with the memory lookup to find the node) to save a bit of space. As a result, the memory impact of the compiled tables is within a couple hundred bytes of the old bsearch-range structure.
Some questions / comments. The BoolTrie and tables are pub, following the pattern of the existing code, but this feels like not-great encapsulation. I'd prefer the impl details to be private, and just the query functions to be pub, but not sure if that would break anything. Thoughts? Should I aggressively dead-code-eliminate the emit of the bsearch? It was useful to have both around during development, now I'm thinking only one is needed. What's the feeling on tests? There are some in doc-tests, but coverage is low. For my own work, I've been doing some randomly-generated tests, but not sure how well that fits in here. Should I do similar magic for case mapping? It's trickier because it's not just a bool, but possible. I'm thinking a different PR though. |
Ah no need to worry much about this. If you can get it private that'd be great, but don't sweat it too much. This crate is unstable so we're fine to change the API at any time basically. I thought the tables module was private regardless, but maybe it's not?
I wonder how hard it might be to just exhaustively test everything? In theory we could generate tests as part of the
Separate PR sounds good to me, but if it's also 10x faster we'd certainly love to have the change :) |
Another silly idea (not sure if worth doing): We could combine all the u64 tables into one, where the first 32 values are as before, but can be reused as values for the other sets of leaves. This risks running out of indices for leaf nodes, but I think the current maximum total number of values is 249. The bigger question is if this would save enough bytes to be at all worth the effort :P. Overall, I like the code as-is, though I agree that marking everything possible as private and removing old code would be nice. |
@raphlinus did you wanna update this with some more tests? Or do you think this is good to go? |
@alexcrichton I'd lean toward "good to go." There isn't an obvious direction the tests should go that doesn't lead to bloat of the code-gen scripts. |
Ok, sounds good to me, thanks again @raphlinus! |
⌛ Testing commit cfaf66c with merge c80d23d... |
💔 Test failed - auto-mac-64-opt-rustbuild |
@bors: retry Umm... what? If that happens again I'll file a spurious issue |
Efficient trie lookup for boolean Unicode properties Replace binary search of ranges with trie lookup using leaves of 64-bit bitmap chunks. Benchmarks suggest this is approximately 10x faster than the bsearch approach.
💥 Test timed out |
@bors: retry |
Efficient trie lookup for boolean Unicode properties Replace binary search of ranges with trie lookup using leaves of 64-bit bitmap chunks. Benchmarks suggest this is approximately 10x faster than the bsearch approach.
Replace binary search of ranges with trie lookup using leaves of
64-bit bitmap chunks. Benchmarks suggest this is approximately 10x
faster than the bsearch approach.