Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum #12661

gf2121 · 2023-10-12T09:25:44Z

jpountz

Like you said in the issue, I'm not sure if this is the root cause or if this is rather due to this method being less hot than readVLong but this change looks equally readable (if not more) to me, so +1 to merge.

gf2121 · 2023-10-13T03:23:18Z

It seems i can reproduce the slow down locally with wikimediumall, and this patch do not work :(
I'll keep digging

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        PKLookup       99.06      (3.7%)       92.91      (4.3%)   -6.2% ( -13% -    1%) 0.000
                      AndHighMed       28.86      (4.3%)       27.07      (3.3%)   -6.2% ( -13% -    1%) 0.000
                         LowTerm      188.04      (8.1%)      179.73      (8.4%)   -4.4% ( -19% -   13%) 0.090
                         Respell       20.54      (2.0%)       19.74      (2.0%)   -3.9% (  -7% -    0%) 0.000
                    OrNotHighMed      146.02      (4.1%)      141.02      (4.7%)   -3.4% ( -11% -    5%) 0.014
                     AndHighHigh       25.42      (3.0%)       24.61      (3.1%)   -3.2% (  -8% -    3%) 0.001
                       OrHighMed       24.28      (3.5%)       23.51      (3.8%)   -3.2% ( -10% -    4%) 0.007
                     MedSpanNear       22.21      (2.1%)       21.55      (2.3%)   -3.0% (  -7% -    1%) 0.000
                 LowSloppyPhrase       13.81      (2.4%)       13.41      (3.1%)   -2.9% (  -8% -    2%) 0.001
                     LowSpanNear       29.48      (2.8%)       28.65      (2.8%)   -2.8% (  -8% -    2%) 0.001
                          Fuzzy1       36.25      (2.5%)       35.30      (3.3%)   -2.6% (  -8% -    3%) 0.005
                   OrHighNotHigh      159.11      (3.8%)      156.87      (4.9%)   -1.4% (  -9% -    7%) 0.311
                        Wildcard      131.03      (3.4%)      129.26      (4.6%)   -1.4% (  -9% -    6%) 0.291
                 MedSloppyPhrase        4.88      (2.6%)        4.81      (4.2%)   -1.3% (  -7% -    5%) 0.254
                    HighSpanNear        3.82      (3.6%)        3.78      (4.0%)   -1.2% (  -8% -    6%) 0.326
                         MedTerm      311.24      (4.4%)      307.78      (4.2%)   -1.1% (  -9% -    7%) 0.416
                      OrHighHigh       15.96      (4.1%)       15.79      (5.0%)   -1.1% (  -9% -    8%) 0.449
                          Fuzzy2       38.36      (2.6%)       38.05      (3.6%)   -0.8% (  -6% -    5%) 0.419
                   OrNotHighHigh      135.72      (3.8%)      134.71      (4.5%)   -0.7% (  -8% -    7%) 0.570
                HighSloppyPhrase        1.81      (2.4%)        1.80      (3.3%)   -0.7% (  -6% -    5%) 0.421
                        HighTerm      201.76      (4.9%)      200.45      (5.0%)   -0.7% ( -10% -    9%) 0.679
                       LowPhrase       16.16      (3.9%)       16.06      (3.2%)   -0.6% (  -7% -    6%) 0.599
                       OrHighLow      245.45      (4.6%)      245.42      (4.4%)   -0.0% (  -8% -    9%) 0.994
                       MedPhrase       48.67      (2.2%)       48.75      (4.1%)    0.2% (  -6% -    6%) 0.862
                    OrHighNotLow      213.59      (4.3%)      214.13      (4.1%)    0.3% (  -7% -    9%) 0.848
                    OrHighNotMed      200.45      (5.3%)      201.51      (4.3%)    0.5% (  -8% -   10%) 0.727
                    OrNotHighLow      246.33      (3.9%)      247.84      (3.7%)    0.6% (  -6% -    8%) 0.609
                      AndHighLow      328.47      (4.2%)      331.94      (4.5%)    1.1% (  -7% -   10%) 0.446
            HighIntervalsOrdered        5.31      (6.3%)        5.38      (5.3%)    1.2% (  -9% -   13%) 0.517
             LowIntervalsOrdered       41.61      (7.7%)       42.20      (6.3%)    1.4% ( -11% -   16%) 0.527
             MedIntervalsOrdered        8.25      (5.1%)        8.38      (4.7%)    1.5% (  -7% -   11%) 0.337
                      HighPhrase       65.11      (3.6%)       66.33      (6.1%)    1.9% (  -7% -   11%) 0.236
                         Prefix3      178.17      (4.1%)      181.58      (5.6%)    1.9% (  -7% -   12%) 0.219

gf2121 · 2023-10-13T03:52:54Z

I made an experiment that run luceneutil without -r, which means we read old format with candidate code. The result looks good, proving that the regression indeed caused by readMSBVLong.

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                         Prefix3       93.67      (3.8%)       91.90      (3.1%)   -1.9% (  -8% -    5%) 0.085
                       OrHighMed       24.75      (4.7%)       24.44      (4.5%)   -1.3% ( -10% -    8%) 0.391
                      AndHighLow      299.77      (4.3%)      297.03      (3.7%)   -0.9% (  -8% -    7%) 0.474
                         LowTerm      274.06      (8.4%)      271.71      (4.4%)   -0.9% ( -12% -   13%) 0.687
                      OrHighHigh       13.74      (3.6%)       13.63      (5.3%)   -0.8% (  -9% -    8%) 0.561
                      HighPhrase       58.48      (2.8%)       58.05      (5.0%)   -0.7% (  -8% -    7%) 0.568
                    OrNotHighMed      128.42      (4.4%)      127.52      (4.5%)   -0.7% (  -9% -    8%) 0.622
                          Fuzzy1       50.71      (2.3%)       50.37      (2.8%)   -0.7% (  -5% -    4%) 0.402
                       MedPhrase       68.38      (5.1%)       68.08      (4.4%)   -0.4% (  -9% -    9%) 0.769
                     AndHighHigh       10.35      (2.8%)       10.33      (2.8%)   -0.2% (  -5% -    5%) 0.844
                          Fuzzy2       28.57      (2.8%)       28.53      (2.0%)   -0.2% (  -4% -    4%) 0.832
                       LowPhrase        8.81      (3.7%)        8.80      (2.6%)   -0.1% (  -6% -    6%) 0.887
                   OrNotHighHigh      111.28      (5.4%)      111.31      (5.0%)    0.0% (  -9% -   10%) 0.983
                         Respell       25.65      (2.5%)       25.68      (2.3%)    0.1% (  -4% -    5%) 0.893
                    OrHighNotLow      147.62      (5.2%)      147.81      (5.6%)    0.1% ( -10% -   11%) 0.939
                        HighTerm      229.21      (4.5%)      229.52      (4.9%)    0.1% (  -8% -    9%) 0.927
                 MedSloppyPhrase        8.03      (2.3%)        8.04      (3.1%)    0.2% (  -5% -    5%) 0.797
                        PKLookup       97.60      (3.7%)       97.84      (4.1%)    0.2% (  -7% -    8%) 0.845
                   OrHighNotHigh      114.38      (4.4%)      114.78      (5.1%)    0.3% (  -8% -   10%) 0.818
             LowIntervalsOrdered        3.93      (2.2%)        3.95      (2.5%)    0.4% (  -4% -    5%) 0.628
                    OrNotHighLow      330.79      (4.4%)      332.00      (3.9%)    0.4% (  -7% -    8%) 0.778
                 LowSloppyPhrase        6.52      (2.9%)        6.55      (4.6%)    0.5% (  -6% -    8%) 0.690
                        Wildcard       89.99      (3.5%)       90.55      (2.9%)    0.6% (  -5% -    7%) 0.545
                     LowSpanNear       22.40      (2.3%)       22.54      (1.7%)    0.6% (  -3% -    4%) 0.335
                    HighSpanNear        0.78      (3.0%)        0.79      (2.4%)    0.7% (  -4% -    6%) 0.405
                     MedSpanNear        8.11      (3.2%)        8.19      (3.0%)    0.9% (  -5% -    7%) 0.345
                HighSloppyPhrase        5.66      (3.6%)        5.71      (4.6%)    1.0% (  -6% -    9%) 0.437
                      AndHighMed       51.51      (4.1%)       52.04      (3.9%)    1.0% (  -6% -    9%) 0.410
                       OrHighLow      130.83      (3.1%)      132.29      (2.8%)    1.1% (  -4% -    7%) 0.233
                    OrHighNotMed      142.66      (5.4%)      144.34      (5.4%)    1.2% (  -9% -   12%) 0.488
                         MedTerm      340.68      (3.7%)      344.79      (4.1%)    1.2% (  -6% -    9%) 0.329
            HighIntervalsOrdered        6.20      (6.4%)        6.28      (5.6%)    1.2% ( -10% -   14%) 0.520
             MedIntervalsOrdered        6.07      (9.6%)        6.18      (8.4%)    1.9% ( -14% -   21%) 0.505

gf2121 · 2023-10-13T06:03:18Z

2000 PK task repeats per JVM:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        PKLookup      128.75      (3.0%)      122.15      (2.6%)   -5.1% ( -10% -    0%) 0.000

gf2121 · 2023-10-13T07:29:44Z

I indeed can not understand how this version of readMSBVLong can be slower than readVLong when JVM is warmed up enough. I wonder if there could be some higher level reasons causing this regression, e.g. we need to do more output add for search as we sharing more output prefix on writing side?

gf2121 · 2023-10-13T07:59:44Z

Maybe this is the todo we need to resolve

gf2121 · 2023-10-16T16:07:32Z

I made some effort to speed up the add operation for BytesRef, getting a tiny improvement:

Baseline: after #12631; Candidate: this patch; Luceneutil Random seed = 0

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          Fuzzy1       50.96      (1.1%)       51.31      (1.6%)    0.7% (  -1% -    3%) 0.114
                        Wildcard       58.12      (2.2%)       58.55      (2.1%)    0.8% (  -3% -    5%) 0.267
                         Respell       32.20      (1.7%)       32.51      (1.3%)    1.0% (  -2% -    4%) 0.044
                          Fuzzy2       36.56      (1.2%)       37.00      (1.2%)    1.2% (  -1% -    3%) 0.002
                        PKLookup      106.36      (2.0%)      108.68      (2.4%)    2.2% (  -2% -    6%) 0.002

But still far more slower than origin patch:

Baseline: before #12631; Candidate: this patch; Luceneutil Random seed = 0

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          Fuzzy1       53.95      (1.4%)       51.22      (1.3%)   -5.1% (  -7% -   -2%) 0.000
                        PKLookup      114.52      (2.5%)      108.73      (2.7%)   -5.0% (  -9% -    0%) 0.000
                          Fuzzy2       38.19      (1.2%)       36.89      (1.3%)   -3.4% (  -5% -    0%) 0.000
                         Respell       33.41      (2.2%)       32.37      (2.4%)   -3.1% (  -7% -    1%) 0.000
                        Wildcard       59.00      (2.6%)       58.15      (2.5%)   -1.4% (  -6% -    3%) 0.072

jpountz · 2023-10-16T20:03:13Z

If we're specializing the format anyway, I wonder if we could try different layouts. E.g. another option could be to encode the number of supplementary bytes using unary coding (like UTF8), which should help reduce the number of (hardly predictable) branches and maybe make decoding faster?

gf2121 · 2023-10-17T07:15:50Z

Hi @jpountz , Thanks a lot for the suggestion!

another option could be to encode the number of supplementary bytes using unary coding (like UTF8).

This is a great idea that probably makes readMSBVLong more faster !

FYI, the direction I'm considering is that it's not "decoding the MSB VLong" that causes this regression, but "how the MSB VLong changes the FST structure":

For LSB VLong output, most/all of the bytes are stored in single arc.
For MSB VLong output, bytes are spilitted into many arcs for prefix sharing.

So we will need to do more Outputs#read and Outputs#add for MSBVLong to get the whole output. Here is a comparing of call times between LSB VLong (before #12631) and MSB VLong (after #12631)

	LSB VLong	MSB VLong	diff
Outputs#read times	116097	149803	29.03%
Outputs#add times	144	111568	77377.78%

Unfortunately, ByteSequenceOutputs#add and ByteSequenceOutputs#read always need to construct new BytesRef objects, not efficient enough. This patch tried to speed up ByteSequenceOutputs#add a bit , getting the tiny improvement mentioned above. But we are still seeing the regression there because add still needed while origin patch just ignore the NO_OUTPUT arcs.

So i'm not very sure the optimization of the decoding output can resolve the regression as it does not look like the bottleneck to me, but I'd like to give a try if you still think it is worth :)

jpountz · 2023-10-17T08:08:39Z

Ohhhhh your explanation makes sense, and I agree with you that a more efficient encoding would unlikely help conterbalance the fact that more arcs need to be read per output. So this looks like a hard search/space trade-off: we either get fast reads or good compression but we can't get both?

I wonder if extending the Outputs class directly would help, instead of storing data in an opaque byte[]?

gf2121 · 2023-10-17T08:50:37Z

I wonder if extending the Outputs class directly would help, instead of storing data in an opaque byte[]?

Yes ,The reuse is exactly what Outputs wants to do ! (see this todo). However, It seems that by contract the Outputs#add should not modify the input objects (this is not declared in java doc but assumed by callers). I was thinking about adding a new method to Outputs like following, WDYT?

/**
 * Default to {@link Outputs#add}. Impls can override for instance reuse.
 */
public Adder<T> adder(T reuse) {
  return new Adder<>() {

    T result;

    @Override
    public void add(T newOne) {
      result = Outputs.this.add(result, newOne);
    }

    @Override
    public T result() {
      return result;
    }
  };
}

public interface Adder<T> {

  void add(T newOne);

  T result();

}

gf2121 · 2023-10-17T09:39:26Z

Hi @mikemccand , it would be great if you can take a look too :)

gf2121 · 2023-10-17T18:36:10Z

An idea comes to me that maybe we do not really need to combine all these BytesRefs to a single BytesRef, we can just build a DataInput view over a BytesRef list to read.
Luckily, only the encoded MSB VLong could probably be spllitted into several arcs in FST while floor data, which usually has a larger length, is guaranteed stored continuously in single arc.

With this optimization, i'm seeing PKLookup almost gets back to original speed:

Baseline: before #12631;
Candidate: this patch;
Luceneutil Random seed = 0

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          Fuzzy2       39.46      (1.3%)       37.93      (1.0%)   -3.9% (  -6% -   -1%) 0.000
                         Respell       33.92      (2.3%)       33.04      (1.4%)   -2.6% (  -6% -    1%) 0.000
                        Wildcard       60.14      (2.6%)       58.73      (2.3%)   -2.3% (  -7% -    2%) 0.004
                          Fuzzy1       54.73      (1.5%)       53.62      (1.2%)   -2.0% (  -4% -    0%) 0.000
                        PKLookup      111.96      (2.3%)      111.45      (2.3%)   -0.5% (  -5% -    4%) 0.556

gf2121 · 2023-10-18T06:27:49Z

So this looks like a hard search/space trade-off: we either get fast reads or good compression but we can't get both?

IMO theoretically yes. We ignored some potential optimization for #add and #read in past so we're suffering from performance regressions that are beyond reasonable. When we take care of this, things could be better.

This PR optimizes #add and brings PKLookup back to its original speed (locally). I think it is ready for review. I'll raise another PR to try to optimize #read, probably reduce the regression for Fuzzy queries.

gf2121 · 2023-10-18T06:29:28Z

lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/FieldReader.java

-    long l = 0L;
-    while (true) {
-      byte b = in.readByte();
+    byte b = in.readByte();


I keep this change here as i think it is more readable :)

gf2121 · 2023-10-19T05:11:48Z

The direction has changed and the conversation is too long to track so i raised #12699 to make a summary of these.

read MSB VLong in new way

92c13b8

jpountz approved these changes Oct 12, 2023

View reviewed changes

gf2121 added 2 commits October 13, 2023 12:12

Merge remote-tracking branch 'origin/main' into readMSBVLong

a5cea31

try declare as ByteArrayDataInput

8de0b46

gf2121 marked this pull request as draft October 13, 2023 04:43

gf2121 added 9 commits October 13, 2023 17:13

try to optimize add

02eb62e

tidy

e057ec1

init cap

a20f444

iter

7d5aaac

optimize intersect terms enum

5eb6c23

fix

1a05688

try simplify

af084d8

fix

b5b9499

bugfix

213a65a

gf2121 force-pushed the readMSBVLong branch from e99b594 to 213a65a Compare October 15, 2023 16:25

gf2121 added 3 commits October 16, 2023 00:46

iter

bfd9b13

iter

747288d

init 32

1d97832

gf2121 changed the title ~~read MSB VLong in new way~~ Optimize outputs accumulating as MSB VLong outputs sharing more output prefix Oct 17, 2023

gf2121 marked this pull request as ready for review October 17, 2023 09:35

gf2121 requested a review from mikemccand October 17, 2023 09:39

gf2121 added 6 commits October 17, 2023 22:33

acc

1b80040

iter

27e7b9b

fix

564837a

iter

f10ee8a

tidy

dd02a72

iter

10ac19d

gf2121 added 3 commits October 18, 2023 02:45

nit fix

4cc5b93

clean up

cae723a

clean up

3b39a22

gf2121 commented Oct 18, 2023

View reviewed changes

gf2121 requested a review from jpountz October 18, 2023 06:29

gf2121 changed the title ~~Optimize outputs accumulating as MSB VLong outputs sharing more output prefix~~ Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum Oct 18, 2023

This was referenced Oct 18, 2023

Avoid object construction when linear searching arcs #12692

Merged

Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum #12699

Merged

gf2121 closed this Oct 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum #12661

Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum #12661

gf2121 commented Oct 12, 2023

jpountz left a comment

gf2121 commented Oct 13, 2023

gf2121 commented Oct 13, 2023

gf2121 commented Oct 13, 2023

gf2121 commented Oct 13, 2023

gf2121 commented Oct 13, 2023

gf2121 commented Oct 16, 2023

jpountz commented Oct 16, 2023

gf2121 commented Oct 17, 2023 •

edited

Loading

jpountz commented Oct 17, 2023

gf2121 commented Oct 17, 2023 •

edited

Loading

gf2121 commented Oct 17, 2023

gf2121 commented Oct 17, 2023 •

edited

Loading

gf2121 commented Oct 18, 2023 •

edited

Loading

gf2121 Oct 18, 2023

gf2121 commented Oct 19, 2023

Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum #12661

Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum #12661

Conversation

gf2121 commented Oct 12, 2023

jpountz left a comment

Choose a reason for hiding this comment

gf2121 commented Oct 13, 2023

gf2121 commented Oct 13, 2023

gf2121 commented Oct 13, 2023

gf2121 commented Oct 13, 2023

gf2121 commented Oct 13, 2023

gf2121 commented Oct 16, 2023

jpountz commented Oct 16, 2023

gf2121 commented Oct 17, 2023 • edited Loading

jpountz commented Oct 17, 2023

gf2121 commented Oct 17, 2023 • edited Loading

gf2121 commented Oct 17, 2023

gf2121 commented Oct 17, 2023 • edited Loading

gf2121 commented Oct 18, 2023 • edited Loading

gf2121 Oct 18, 2023

Choose a reason for hiding this comment

gf2121 commented Oct 19, 2023

gf2121 commented Oct 17, 2023 •

edited

Loading

gf2121 commented Oct 17, 2023 •

edited

Loading

gf2121 commented Oct 17, 2023 •

edited

Loading

gf2121 commented Oct 18, 2023 •

edited

Loading