-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum #12661
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like you said in the issue, I'm not sure if this is the root cause or if this is rather due to this method being less hot than readVLong
but this change looks equally readable (if not more) to me, so +1 to merge.
It seems i can reproduce the slow down locally with
|
I made an experiment that run luceneutil without
|
2000 PK task repeats per JVM:
|
I indeed can not understand how this version of |
Maybe this is the todo we need to resolve |
I made some effort to speed up the
But still far more slower than origin patch:
|
If we're specializing the format anyway, I wonder if we could try different layouts. E.g. another option could be to encode the number of supplementary bytes using unary coding (like UTF8), which should help reduce the number of (hardly predictable) branches and maybe make decoding faster? |
Hi @jpountz , Thanks a lot for the suggestion!
This is a great idea that probably makes FYI, the direction I'm considering is that it's not "decoding the MSB VLong" that causes this regression, but "how the MSB VLong changes the FST structure":
So we will need to do more
Unfortunately, So i'm not very sure the optimization of the decoding output can resolve the regression as it does not look like the bottleneck to me, but I'd like to give a try if you still think it is worth :) |
Ohhhhh your explanation makes sense, and I agree with you that a more efficient encoding would unlikely help conterbalance the fact that more arcs need to be read per output. So this looks like a hard search/space trade-off: we either get fast reads or good compression but we can't get both? I wonder if extending the |
Yes ,The reuse is exactly what
|
Hi @mikemccand , it would be great if you can take a look too :) |
An idea comes to me that maybe we do not really need to combine all these With this optimization, i'm seeing PKLookup almost gets back to original speed:
|
IMO theoretically yes. We ignored some potential optimization for This PR optimizes |
long l = 0L; | ||
while (true) { | ||
byte b = in.readByte(); | ||
byte b = in.readByte(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I keep this change here as i think it is more readable :)
The direction has changed and the conversation is too long to track so i raised #12699 to make a summary of these. |
ISSUE: #12659