Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum #12699

Merged
merged 7 commits into from
Nov 28, 2023

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Oct 19, 2023

Description

Previous talk is too long to track so i opened a new PR to make a summery here. More details are available in #12661.

After merging of #12631, we found a regression for term-dictionary-related tasks like PKLookup, Fuzzy ... After some digging I suspect that the regression is caused by more Outputs#add and Outputs#read on reading side, as the 'MSB VLong output format' making FST sharing more output prefix. The difference of calling times of Outputs#add and Outputs#read is shown below:

  LSB VLong MSB VLong diff
Outputs#read times 116097 149803 29.03%
Outputs#add times 144 111568 77377.78%

Solution

This patch tries to do two optimizations to get back the speed:

  • Instead of combining all outputs into single one, this patch collect all BytesRefs into an array and build a DataInput view over it, reducing the cost of objects construction and memory copy.
  • Instead of copying bytes into floor data, this patch directly uses the last output as floor data. Floor data is guaranteed continuous in single arc because we can not have two same fp encoded before floor data.

Benchmark

Both run 30 JVM round and 50 tasks per JVM:

Comparing to after merging of #12631

BaseLine ( After merging of #12631 )

$ git log
Write MSB VLong for better outputs sharing in block tree index (#12631)
DeletedTerms#clear should reset ByteBlockPool (#12630)

Candidate

$ git log
Optimize output accumulating
Write MSB VLong for better outputs sharing in block tree index
DeletedTerms#clear should reset ByteBlockPool

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                   OrHighNotHigh      166.79      (4.5%)      165.12      (4.3%)   -1.0% (  -9% -    8%) 0.376
                    OrHighNotLow      258.47      (4.3%)      256.46      (3.3%)   -0.8% (  -7% -    7%) 0.430
            HighIntervalsOrdered        0.38      (3.3%)        0.38      (3.1%)   -0.8% (  -6% -    5%) 0.359
                 LowSloppyPhrase        1.64      (3.4%)        1.63      (4.9%)   -0.7% (  -8% -    7%) 0.544
                   OrNotHighHigh      150.83      (5.1%)      149.86      (4.4%)   -0.6% (  -9% -    9%) 0.597
                HighSloppyPhrase        3.67      (3.3%)        3.64      (3.6%)   -0.6% (  -7% -    6%) 0.470
                         Prefix3       67.16      (4.3%)       66.78      (2.8%)   -0.6% (  -7% -    6%) 0.543
                    OrNotHighMed      135.65      (4.4%)      134.95      (3.7%)   -0.5% (  -8% -    7%) 0.620
                    OrNotHighLow      257.14      (2.0%)      255.85      (1.7%)   -0.5% (  -4% -    3%) 0.291
                     AndHighHigh       19.33      (2.8%)       19.25      (2.2%)   -0.4% (  -5% -    4%) 0.517
                    OrHighNotMed      128.51      (5.3%)      128.06      (4.2%)   -0.3% (  -9% -    9%) 0.778
                 MedSloppyPhrase        9.56      (3.6%)        9.54      (3.8%)   -0.3% (  -7% -    7%) 0.788
                      OrHighHigh       13.23      (3.3%)       13.20      (3.2%)   -0.2% (  -6% -    6%) 0.801
                       OrHighLow      199.40      (1.9%)      199.18      (1.7%)   -0.1% (  -3% -    3%) 0.812
                        Wildcard       69.22      (2.7%)       69.19      (1.9%)   -0.0% (  -4% -    4%) 0.952
                      AndHighMed       42.14      (3.7%)       42.14      (2.9%)    0.0% (  -6% -    6%) 0.993
                    HighSpanNear        7.00      (3.1%)        7.00      (3.1%)    0.1% (  -5% -    6%) 0.945
                          Fuzzy1       44.05      (1.8%)       44.08      (1.7%)    0.1% (  -3% -    3%) 0.872
                       OrHighMed       38.54      (2.6%)       38.59      (3.2%)    0.1% (  -5% -    6%) 0.871
                         Respell       28.40      (2.2%)       28.45      (2.2%)    0.2% (  -4% -    4%) 0.790
                     LowSpanNear        4.23      (1.7%)        4.24      (1.5%)    0.2% (  -2% -    3%) 0.580
             MedIntervalsOrdered        3.47      (6.6%)        3.49      (7.3%)    0.3% ( -12% -   15%) 0.856
             LowIntervalsOrdered        2.76      (4.3%)        2.77      (5.0%)    0.4% (  -8% -   10%) 0.761
                          Fuzzy2       35.28      (1.6%)       35.42      (1.6%)    0.4% (  -2% -    3%) 0.325
                      HighPhrase       38.91      (2.6%)       39.07      (2.5%)    0.4% (  -4% -    5%) 0.532
                         MedTerm      248.39      (4.2%)      249.48      (4.2%)    0.4% (  -7% -    9%) 0.685
                       LowPhrase       26.44      (2.0%)       26.57      (2.3%)    0.5% (  -3% -    4%) 0.385
                       MedPhrase        9.51      (2.3%)        9.57      (2.6%)    0.6% (  -4% -    5%) 0.365
                      AndHighLow      338.26      (3.0%)      340.61      (2.6%)    0.7% (  -4% -    6%) 0.339
                         LowTerm      242.60      (4.5%)      244.38      (4.8%)    0.7% (  -8% -   10%) 0.545
                     MedSpanNear        1.11      (1.4%)        1.12      (1.5%)    0.8% (  -2% -    3%) 0.042
                        HighTerm      230.63      (3.8%)      233.01      (4.6%)    1.0% (  -7% -    9%) 0.339
                        PKLookup      100.76      (2.9%)      105.17      (3.2%)    4.4% (  -1% -   10%) 0.000

Comparing to before merging of #12631

BaseLine ( Before merging of #12631 )

$ git log
DeletedTerms#clear should reset ByteBlockPool

Candidate

$ git log
Optimize output accumulating
Write MSB VLong for better outputs sharing in block tree index
DeletedTerms#clear should reset ByteBlockPool

(Not sure why seeing tiny speed up for Orxxx tasks. I suspect it is a noise)

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
             MedIntervalsOrdered        3.50      (7.9%)        3.43      (7.4%)   -2.2% ( -16% -   14%) 0.269
                        Wildcard       70.27      (2.9%)       69.22      (2.2%)   -1.5% (  -6% -    3%) 0.023
            HighIntervalsOrdered        0.38      (4.0%)        0.37      (4.0%)   -1.3% (  -8% -    6%) 0.194
             LowIntervalsOrdered        2.78      (5.3%)        2.75      (5.2%)   -1.1% ( -10% -    9%) 0.417
                         Prefix3       66.95      (4.3%)       66.25      (3.1%)   -1.0% (  -8% -    6%) 0.282
                         LowTerm      246.48      (8.3%)      244.15      (6.8%)   -0.9% ( -14% -   15%) 0.631
                       MedPhrase        9.57      (3.6%)        9.50      (3.1%)   -0.8% (  -7% -    6%) 0.366
                HighSloppyPhrase        3.63      (3.5%)        3.62      (3.8%)   -0.3% (  -7% -    7%) 0.761
                      HighPhrase       39.07      (3.5%)       38.96      (3.3%)   -0.3% (  -6% -    6%) 0.749
                        PKLookup      105.27      (2.8%)      105.13      (3.0%)   -0.1% (  -5% -    5%) 0.859
                 MedSloppyPhrase        9.49      (3.2%)        9.48      (3.9%)   -0.1% (  -6% -    7%) 0.920
                         Respell       28.45      (2.6%)       28.46      (2.2%)    0.0% (  -4% -    5%) 0.957
                 LowSloppyPhrase        1.63      (5.0%)        1.63      (5.6%)    0.1% (  -9% -   11%) 0.929
                       LowPhrase       26.31      (3.3%)       26.36      (2.1%)    0.2% (  -5% -    5%) 0.782
                          Fuzzy1       43.71      (1.3%)       43.82      (1.7%)    0.3% (  -2% -    3%) 0.509
                   OrHighNotHigh      164.97      (3.7%)      165.40      (4.4%)    0.3% (  -7% -    8%) 0.806
                     AndHighHigh       19.17      (2.1%)       19.23      (2.3%)    0.3% (  -3% -    4%) 0.569
                     MedSpanNear        1.11      (2.0%)        1.12      (2.0%)    0.6% (  -3% -    4%) 0.266
                      AndHighLow      338.06      (2.1%)      340.30      (1.9%)    0.7% (  -3% -    4%) 0.196
                     LowSpanNear        4.20      (2.4%)        4.23      (2.1%)    0.7% (  -3% -    5%) 0.194
                      OrHighHigh       13.15      (4.0%)       13.26      (4.5%)    0.8% (  -7% -    9%) 0.454
                   OrNotHighHigh      148.59      (3.9%)      149.94      (5.1%)    0.9% (  -7% -   10%) 0.442
                        HighTerm      229.24      (3.7%)      231.60      (4.2%)    1.0% (  -6% -    9%) 0.314
                    HighSpanNear        6.94      (3.3%)        7.01      (3.6%)    1.1% (  -5% -    8%) 0.224
                    OrHighNotMed      126.33      (4.3%)      127.74      (4.9%)    1.1% (  -7% -   10%) 0.343
                    OrHighNotLow      254.19      (3.1%)      257.19      (3.9%)    1.2% (  -5% -    8%) 0.200
                      AndHighMed       41.35      (5.0%)       42.00      (3.2%)    1.6% (  -6% -   10%) 0.148
                          Fuzzy2       34.61      (1.8%)       35.34      (2.1%)    2.1% (  -1% -    6%) 0.000
                       OrHighMed       37.84      (5.8%)       38.65      (4.2%)    2.1% (  -7% -   12%) 0.107
                    OrNotHighLow      248.52      (2.8%)      254.75      (2.5%)    2.5% (  -2% -    7%) 0.000
                       OrHighLow      193.81      (2.3%)      200.07      (2.5%)    3.2% (  -1% -    8%) 0.000
                    OrNotHighMed      130.29      (3.6%)      134.66      (4.7%)    3.4% (  -4% -   12%) 0.002
                         MedTerm      241.15      (3.8%)      249.46      (4.5%)    3.4% (  -4% -   12%) 0.001

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Net/net I love this idea -- it's basically the same idea as StringBuilder (append small strings to list and concatenate only in the end), better than the O(N^2) "just keep concatenating a single bigger and bigger string" that FST does today.

I love that this change erases a longstanding TODO also!

I wonder if we could make this appendable output more widely accessible (in oal.fst.util) for other use cases that need Outputs<BytesRef>, but let's not do that here.

I left some small comments ... I think this is close. It would be nice to recover the PKLookup perf hit. Thanks @gf2121!

@mikemccand
Copy link
Member

I'll try to review the latest PR soon -- thanks @gf2121.

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gf2121 -- I left mostly smallish comments, but some head scratching -- hard to remember how these two crazy iterators work.

}

void pop() {
this.num--;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe assert num > 0 above this line?

Also you don't need the this. prefix when accessing the members in these methods.


void load(SegmentTermsEnum.OutputAccumulator outputAccumulator) throws IOException {
outputAccumulator.prepareRead();
final long code = ite.fr.readVLongOutput(outputAccumulator);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this silly final?

load(code);
}

void load(Long boxCode) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm rename to blockCode? Not sure why it's a box :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I called it box because we are using a big Long... blockCode is better :)

}

void load(Long boxCode) throws IOException {
if (boxCode != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add comment that this means this block is the first one in a possible sequence of floor blocks corresponding to a single seek point from the FST terms index?

@@ -373,9 +380,6 @@ public boolean seekExact(BytesRef target) throws IOException {

int cmp = 0;

// TODO: reverse vLong byte order for better FST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, stale TODO garbage collection!

int index;

void push(BytesRef output) {
if (output != Lucene90BlockTreeTermsReader.NO_OUTPUT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to just check if (output.length > 0) {? It'd handle cases (not sure this happens) where caller is passing an empty BytesRef that is not NO_OUTPUT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This object comparing is what we did before. I believe it is right because we have strict contract for Outputs operations.

Copy link
Member

@benwtrent benwtrent Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we have strict contracts, I can see that BytesSequenceOutputs#add(BytesRef, BytesRef) has assertions that the length is > 0.

Here in OutputAccumulator readByte() makes a big assumption that a BytesRef has at least length 1. If it had a length of 0, we would read past the ref end and read bytes sitting in a byte[] that we shouldn't.

IMO OutputAccumulator needs to be way more cautious than BytesSequenceOutputs#add(BytesRef, BytesRef) because OutputAccumulator isn't making copies and is relying on the underlying byte arrays not changing.

}

void setFloorData(ByteArrayDataInput floorData) {
assert outputIndex == num - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also assert index == 0 in this assert? I.e. we are at the start of the last BytesRef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the index can be non-zero here. The last arc could contains suffix of the MSBVLong output.

this.current = outputs[0];
}

void setFloorData(ByteArrayDataInput floorData) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment that, while this reads/decodes bytes into floorData, it does not alter the read position?

floorDataReader.reset(floorData, 0, numBytes);
public void setFloorData(SegmentTermsEnum.OutputAccumulator outputAccumulator) {
outputAccumulator.setFloorData(floorDataReader);
rewindPos = floorDataReader.getPosition();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this above line 107? It smells a bit putting this after -- it requires knowing setFloorData doesn't change the read position.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm i'm a bit confused. I think setFloorData changes the position of floorDataReader because it calls
floorDataReader#reset, so we can not move this line above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops you're right! Sorry :)

@@ -247,7 +243,7 @@ void rewind() {
nextEnt = -1;
hasTerms = hasTermsOrig;
if (isFloor) {
floorDataReader.rewind();
floorDataReader.setPosition(rewindPos);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is still needed? setFloorData didn't change the position? Or might other operations before the reset() have altered the read position? Hard to think about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need this. The floorDataReader is a ByteArrayDataInput and its position will get changed when we read from it.

@gf2121
Copy link
Contributor Author

gf2121 commented Nov 22, 2023

Thanks for review @mikemccand !

but some head scratching -- hard to remember how these two crazy iterators work.

Agree that this is head scratching... I make a chart to try to explain this better.
 
image

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @gf2121 -- you want to merge and backport to 9.x?

@gf2121
Copy link
Contributor Author

gf2121 commented Nov 28, 2023

Thanks for review and great suggestions @mikemccand !

you want to merge and backport to 9.x?

Yes. I'll merge and backport this this.

@gf2121 gf2121 merged commit 9574cbd into apache:main Nov 28, 2023
4 checks passed
@gf2121 gf2121 added this to the 9.9.0 milestone Nov 28, 2023
@gf2121
Copy link
Contributor Author

gf2121 commented Nov 29, 2023

Hi @mikemccand @jpountz !

The newest nightly benchmark shows we get back some speed for PKLookUp, but still not as good as the performance before the merge of #12631.

This result is not quite same as what i saw locally, but reasonable for me as we do more Outputs#read anyway.

Since we are preparing for the release of 9.9.0. I'm not sure if it is an acceptable trade-off now. Should we backport this or revert #12631? I'd appreciate your suggestions :)

@mikemccand
Copy link
Member

I don't think we should revert #12631 -- that one was a nice disk space improvement, and the PKLookup performance is quite noisy -- going up or down by just as much when we upgrade JDKs or switch to Panama MMap impl. Thanks @gf2121.

gf2121 added a commit to gf2121/lucene that referenced this pull request Nov 29, 2023
@gf2121
Copy link
Contributor Author

gf2121 commented Nov 29, 2023

Thanks @mikemccand for feedback! I'll backport this to 9.9.0 then.

@jpountz
Copy link
Contributor

jpountz commented Nov 29, 2023

+1 to not reverting #12631. The fact that our terms index used to not share prefixes properly was a bug, I'm glad it's fixed even though it may make performance slightly slower because more nodes have non-empty outputs to contribute.

FWIW it looks like your change did not only help PKLookup, but possibly also CountTerm, Respell, Fuzzy1 and Fuzzy2.

rjernst added a commit to rjernst/lucene that referenced this pull request Dec 8, 2023
rjernst added a commit to rjernst/lucene that referenced this pull request Dec 8, 2023
rjernst added a commit to rjernst/elasticsearch that referenced this pull request Dec 8, 2023
This commit updates Lucene to a patched version of 9.9.0 without
apache/lucene#12699. See
apache/lucene#12895.
elasticsearchmachine pushed a commit to elastic/elasticsearch that referenced this pull request Dec 9, 2023
This commit updates Lucene to a patched version of 9.9.0 without
apache/lucene#12699. See
apache/lucene#12895.
javanna added a commit to javanna/lucene that referenced this pull request Dec 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants