Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum #12661

Closed
wants to merge 24 commits into from

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Oct 12, 2023

ISSUE: #12659

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like you said in the issue, I'm not sure if this is the root cause or if this is rather due to this method being less hot than readVLong but this change looks equally readable (if not more) to me, so +1 to merge.

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 13, 2023

It seems i can reproduce the slow down locally with wikimediumall, and this patch do not work :(
I'll keep digging

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        PKLookup       99.06      (3.7%)       92.91      (4.3%)   -6.2% ( -13% -    1%) 0.000
                      AndHighMed       28.86      (4.3%)       27.07      (3.3%)   -6.2% ( -13% -    1%) 0.000
                         LowTerm      188.04      (8.1%)      179.73      (8.4%)   -4.4% ( -19% -   13%) 0.090
                         Respell       20.54      (2.0%)       19.74      (2.0%)   -3.9% (  -7% -    0%) 0.000
                    OrNotHighMed      146.02      (4.1%)      141.02      (4.7%)   -3.4% ( -11% -    5%) 0.014
                     AndHighHigh       25.42      (3.0%)       24.61      (3.1%)   -3.2% (  -8% -    3%) 0.001
                       OrHighMed       24.28      (3.5%)       23.51      (3.8%)   -3.2% ( -10% -    4%) 0.007
                     MedSpanNear       22.21      (2.1%)       21.55      (2.3%)   -3.0% (  -7% -    1%) 0.000
                 LowSloppyPhrase       13.81      (2.4%)       13.41      (3.1%)   -2.9% (  -8% -    2%) 0.001
                     LowSpanNear       29.48      (2.8%)       28.65      (2.8%)   -2.8% (  -8% -    2%) 0.001
                          Fuzzy1       36.25      (2.5%)       35.30      (3.3%)   -2.6% (  -8% -    3%) 0.005
                   OrHighNotHigh      159.11      (3.8%)      156.87      (4.9%)   -1.4% (  -9% -    7%) 0.311
                        Wildcard      131.03      (3.4%)      129.26      (4.6%)   -1.4% (  -9% -    6%) 0.291
                 MedSloppyPhrase        4.88      (2.6%)        4.81      (4.2%)   -1.3% (  -7% -    5%) 0.254
                    HighSpanNear        3.82      (3.6%)        3.78      (4.0%)   -1.2% (  -8% -    6%) 0.326
                         MedTerm      311.24      (4.4%)      307.78      (4.2%)   -1.1% (  -9% -    7%) 0.416
                      OrHighHigh       15.96      (4.1%)       15.79      (5.0%)   -1.1% (  -9% -    8%) 0.449
                          Fuzzy2       38.36      (2.6%)       38.05      (3.6%)   -0.8% (  -6% -    5%) 0.419
                   OrNotHighHigh      135.72      (3.8%)      134.71      (4.5%)   -0.7% (  -8% -    7%) 0.570
                HighSloppyPhrase        1.81      (2.4%)        1.80      (3.3%)   -0.7% (  -6% -    5%) 0.421
                        HighTerm      201.76      (4.9%)      200.45      (5.0%)   -0.7% ( -10% -    9%) 0.679
                       LowPhrase       16.16      (3.9%)       16.06      (3.2%)   -0.6% (  -7% -    6%) 0.599
                       OrHighLow      245.45      (4.6%)      245.42      (4.4%)   -0.0% (  -8% -    9%) 0.994
                       MedPhrase       48.67      (2.2%)       48.75      (4.1%)    0.2% (  -6% -    6%) 0.862
                    OrHighNotLow      213.59      (4.3%)      214.13      (4.1%)    0.3% (  -7% -    9%) 0.848
                    OrHighNotMed      200.45      (5.3%)      201.51      (4.3%)    0.5% (  -8% -   10%) 0.727
                    OrNotHighLow      246.33      (3.9%)      247.84      (3.7%)    0.6% (  -6% -    8%) 0.609
                      AndHighLow      328.47      (4.2%)      331.94      (4.5%)    1.1% (  -7% -   10%) 0.446
            HighIntervalsOrdered        5.31      (6.3%)        5.38      (5.3%)    1.2% (  -9% -   13%) 0.517
             LowIntervalsOrdered       41.61      (7.7%)       42.20      (6.3%)    1.4% ( -11% -   16%) 0.527
             MedIntervalsOrdered        8.25      (5.1%)        8.38      (4.7%)    1.5% (  -7% -   11%) 0.337
                      HighPhrase       65.11      (3.6%)       66.33      (6.1%)    1.9% (  -7% -   11%) 0.236
                         Prefix3      178.17      (4.1%)      181.58      (5.6%)    1.9% (  -7% -   12%) 0.219

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 13, 2023

I made an experiment that run luceneutil without -r, which means we read old format with candidate code. The result looks good, proving that the regression indeed caused by readMSBVLong.

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                         Prefix3       93.67      (3.8%)       91.90      (3.1%)   -1.9% (  -8% -    5%) 0.085
                       OrHighMed       24.75      (4.7%)       24.44      (4.5%)   -1.3% ( -10% -    8%) 0.391
                      AndHighLow      299.77      (4.3%)      297.03      (3.7%)   -0.9% (  -8% -    7%) 0.474
                         LowTerm      274.06      (8.4%)      271.71      (4.4%)   -0.9% ( -12% -   13%) 0.687
                      OrHighHigh       13.74      (3.6%)       13.63      (5.3%)   -0.8% (  -9% -    8%) 0.561
                      HighPhrase       58.48      (2.8%)       58.05      (5.0%)   -0.7% (  -8% -    7%) 0.568
                    OrNotHighMed      128.42      (4.4%)      127.52      (4.5%)   -0.7% (  -9% -    8%) 0.622
                          Fuzzy1       50.71      (2.3%)       50.37      (2.8%)   -0.7% (  -5% -    4%) 0.402
                       MedPhrase       68.38      (5.1%)       68.08      (4.4%)   -0.4% (  -9% -    9%) 0.769
                     AndHighHigh       10.35      (2.8%)       10.33      (2.8%)   -0.2% (  -5% -    5%) 0.844
                          Fuzzy2       28.57      (2.8%)       28.53      (2.0%)   -0.2% (  -4% -    4%) 0.832
                       LowPhrase        8.81      (3.7%)        8.80      (2.6%)   -0.1% (  -6% -    6%) 0.887
                   OrNotHighHigh      111.28      (5.4%)      111.31      (5.0%)    0.0% (  -9% -   10%) 0.983
                         Respell       25.65      (2.5%)       25.68      (2.3%)    0.1% (  -4% -    5%) 0.893
                    OrHighNotLow      147.62      (5.2%)      147.81      (5.6%)    0.1% ( -10% -   11%) 0.939
                        HighTerm      229.21      (4.5%)      229.52      (4.9%)    0.1% (  -8% -    9%) 0.927
                 MedSloppyPhrase        8.03      (2.3%)        8.04      (3.1%)    0.2% (  -5% -    5%) 0.797
                        PKLookup       97.60      (3.7%)       97.84      (4.1%)    0.2% (  -7% -    8%) 0.845
                   OrHighNotHigh      114.38      (4.4%)      114.78      (5.1%)    0.3% (  -8% -   10%) 0.818
             LowIntervalsOrdered        3.93      (2.2%)        3.95      (2.5%)    0.4% (  -4% -    5%) 0.628
                    OrNotHighLow      330.79      (4.4%)      332.00      (3.9%)    0.4% (  -7% -    8%) 0.778
                 LowSloppyPhrase        6.52      (2.9%)        6.55      (4.6%)    0.5% (  -6% -    8%) 0.690
                        Wildcard       89.99      (3.5%)       90.55      (2.9%)    0.6% (  -5% -    7%) 0.545
                     LowSpanNear       22.40      (2.3%)       22.54      (1.7%)    0.6% (  -3% -    4%) 0.335
                    HighSpanNear        0.78      (3.0%)        0.79      (2.4%)    0.7% (  -4% -    6%) 0.405
                     MedSpanNear        8.11      (3.2%)        8.19      (3.0%)    0.9% (  -5% -    7%) 0.345
                HighSloppyPhrase        5.66      (3.6%)        5.71      (4.6%)    1.0% (  -6% -    9%) 0.437
                      AndHighMed       51.51      (4.1%)       52.04      (3.9%)    1.0% (  -6% -    9%) 0.410
                       OrHighLow      130.83      (3.1%)      132.29      (2.8%)    1.1% (  -4% -    7%) 0.233
                    OrHighNotMed      142.66      (5.4%)      144.34      (5.4%)    1.2% (  -9% -   12%) 0.488
                         MedTerm      340.68      (3.7%)      344.79      (4.1%)    1.2% (  -6% -    9%) 0.329
            HighIntervalsOrdered        6.20      (6.4%)        6.28      (5.6%)    1.2% ( -10% -   14%) 0.520
             MedIntervalsOrdered        6.07      (9.6%)        6.18      (8.4%)    1.9% ( -14% -   21%) 0.505

@gf2121 gf2121 marked this pull request as draft October 13, 2023 04:43
@gf2121
Copy link
Contributor Author

gf2121 commented Oct 13, 2023

2000 PK task repeats per JVM:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        PKLookup      128.75      (3.0%)      122.15      (2.6%)   -5.1% ( -10% -    0%) 0.000

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 13, 2023

I indeed can not understand how this version of readMSBVLong can be slower than readVLong when JVM is warmed up enough. I wonder if there could be some higher level reasons causing this regression, e.g. we need to do more output add for search as we sharing more output prefix on writing side?

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 13, 2023

Maybe this is the todo we need to resolve

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 16, 2023

I made some effort to speed up the add operation for BytesRef, getting a tiny improvement:

Baseline: after #12631; Candidate: this patch; Luceneutil Random seed = 0

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          Fuzzy1       50.96      (1.1%)       51.31      (1.6%)    0.7% (  -1% -    3%) 0.114
                        Wildcard       58.12      (2.2%)       58.55      (2.1%)    0.8% (  -3% -    5%) 0.267
                         Respell       32.20      (1.7%)       32.51      (1.3%)    1.0% (  -2% -    4%) 0.044
                          Fuzzy2       36.56      (1.2%)       37.00      (1.2%)    1.2% (  -1% -    3%) 0.002
                        PKLookup      106.36      (2.0%)      108.68      (2.4%)    2.2% (  -2% -    6%) 0.002

But still far more slower than origin patch:

Baseline: before #12631; Candidate: this patch; Luceneutil Random seed = 0

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          Fuzzy1       53.95      (1.4%)       51.22      (1.3%)   -5.1% (  -7% -   -2%) 0.000
                        PKLookup      114.52      (2.5%)      108.73      (2.7%)   -5.0% (  -9% -    0%) 0.000
                          Fuzzy2       38.19      (1.2%)       36.89      (1.3%)   -3.4% (  -5% -    0%) 0.000
                         Respell       33.41      (2.2%)       32.37      (2.4%)   -3.1% (  -7% -    1%) 0.000
                        Wildcard       59.00      (2.6%)       58.15      (2.5%)   -1.4% (  -6% -    3%) 0.072

@jpountz
Copy link
Contributor

jpountz commented Oct 16, 2023

If we're specializing the format anyway, I wonder if we could try different layouts. E.g. another option could be to encode the number of supplementary bytes using unary coding (like UTF8), which should help reduce the number of (hardly predictable) branches and maybe make decoding faster?

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 17, 2023

Hi @jpountz , Thanks a lot for the suggestion!

another option could be to encode the number of supplementary bytes using unary coding (like UTF8).

This is a great idea that probably makes readMSBVLong more faster !

FYI, the direction I'm considering is that it's not "decoding the MSB VLong" that causes this regression, but "how the MSB VLong changes the FST structure":

  • For LSB VLong output, most/all of the bytes are stored in single arc.
  • For MSB VLong output, bytes are spilitted into many arcs for prefix sharing.

So we will need to do more Outputs#read and Outputs#add for MSBVLong to get the whole output. Here is a comparing of call times between LSB VLong (before #12631) and MSB VLong (after #12631)

  LSB VLong MSB VLong diff
Outputs#read times 116097 149803 29.03%
Outputs#add times 144 111568 77377.78%

Unfortunately, ByteSequenceOutputs#add and ByteSequenceOutputs#read always need to construct new BytesRef objects, not efficient enough. This patch tried to speed up ByteSequenceOutputs#add a bit , getting the tiny improvement mentioned above. But we are still seeing the regression there because add still needed while origin patch just ignore the NO_OUTPUT arcs.

So i'm not very sure the optimization of the decoding output can resolve the regression as it does not look like the bottleneck to me, but I'd like to give a try if you still think it is worth :)

@jpountz
Copy link
Contributor

jpountz commented Oct 17, 2023

Ohhhhh your explanation makes sense, and I agree with you that a more efficient encoding would unlikely help conterbalance the fact that more arcs need to be read per output. So this looks like a hard search/space trade-off: we either get fast reads or good compression but we can't get both?

I wonder if extending the Outputs class directly would help, instead of storing data in an opaque byte[]?

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 17, 2023

I wonder if extending the Outputs class directly would help, instead of storing data in an opaque byte[]?

Yes ,The reuse is exactly what Outputs wants to do ! (see this todo). However, It seems that by contract the Outputs#add should not modify the input objects (this is not declared in java doc but assumed by callers). I was thinking about adding a new method to Outputs like following, WDYT?

/**
 * Default to {@link Outputs#add}. Impls can override for instance reuse.
 */
public Adder<T> adder(T reuse) {
  return new Adder<>() {

    T result;

    @Override
    public void add(T newOne) {
      result = Outputs.this.add(result, newOne);
    }

    @Override
    public T result() {
      return result;
    }
  };
}

public interface Adder<T> {

  void add(T newOne);

  T result();

}

@gf2121 gf2121 changed the title read MSB VLong in new way Optimize outputs accumulating as MSB VLong outputs sharing more output prefix Oct 17, 2023
@gf2121 gf2121 marked this pull request as ready for review October 17, 2023 09:35
@gf2121
Copy link
Contributor Author

gf2121 commented Oct 17, 2023

Hi @mikemccand , it would be great if you can take a look too :)

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 17, 2023

An idea comes to me that maybe we do not really need to combine all these BytesRefs to a single BytesRef, we can just build a DataInput view over a BytesRef list to read.
Luckily, only the encoded MSB VLong could probably be spllitted into several arcs in FST while floor data, which usually has a larger length, is guaranteed stored continuously in single arc.

With this optimization, i'm seeing PKLookup almost gets back to original speed:

Baseline: before #12631;
Candidate: this patch;
Luceneutil Random seed = 0

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          Fuzzy2       39.46      (1.3%)       37.93      (1.0%)   -3.9% (  -6% -   -1%) 0.000
                         Respell       33.92      (2.3%)       33.04      (1.4%)   -2.6% (  -6% -    1%) 0.000
                        Wildcard       60.14      (2.6%)       58.73      (2.3%)   -2.3% (  -7% -    2%) 0.004
                          Fuzzy1       54.73      (1.5%)       53.62      (1.2%)   -2.0% (  -4% -    0%) 0.000
                        PKLookup      111.96      (2.3%)      111.45      (2.3%)   -0.5% (  -5% -    4%) 0.556

@gf2121
Copy link
Contributor Author

gf2121 commented Oct 18, 2023

So this looks like a hard search/space trade-off: we either get fast reads or good compression but we can't get both?

IMO theoretically yes. We ignored some potential optimization for #add and #read in past so we're suffering from performance regressions that are beyond reasonable. When we take care of this, things could be better.

This PR optimizes #add and brings PKLookup back to its original speed (locally). I think it is ready for review. I'll raise another PR to try to optimize #read, probably reduce the regression for Fuzzy queries.

long l = 0L;
while (true) {
byte b = in.readByte();
byte b = in.readByte();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep this change here as i think it is more readable :)

@gf2121 gf2121 requested a review from jpountz October 18, 2023 06:29
@gf2121 gf2121 changed the title Optimize outputs accumulating as MSB VLong outputs sharing more output prefix Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum Oct 18, 2023
@gf2121
Copy link
Contributor Author

gf2121 commented Oct 19, 2023

The direction has changed and the conversation is too long to track so i raised #12699 to make a summary of these.

@gf2121 gf2121 closed this Oct 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants