Nightly benchmark regression for term dict queries #12659

gf2121 · 2023-10-12T05:19:46Z

Description

I'm seeing regressions of term-dict queries on nightly benchmark:
https://home.apache.org/~mikemccand/lucenebench/2023.10.10.18.03.55.html

Diff commits: 33a3af4...05d26ac

IMO the commit that most probably cause these regressions is: #12631. But Mike helped run luceneutil before merge and result looks good ...

The text was updated successfully, but these errors were encountered:

gf2121 · 2023-10-12T08:11:24Z

I write a JMH benchmark to compare reading vint vs msbvint.

Benchmark                         (size)  (valueBit)   Mode  Cnt   Score   Error   Units
ReadVLongBenchmark.readVLong         128           7  thrpt    5  29.406 ± 0.148  ops/us
ReadVLongBenchmark.readVLong         128          14  thrpt    5   3.644 ± 0.013  ops/us
ReadVLongBenchmark.readVLong         128          21  thrpt    5   2.416 ± 0.004  ops/us
ReadVLongBenchmark.readVLong         128          28  thrpt    5   2.074 ± 0.005  ops/us
ReadVLongBenchmark.readVLong         128          35  thrpt    5   1.838 ± 0.001  ops/us
ReadVLongBenchmark.readVLong         128          42  thrpt    5   1.958 ± 0.004  ops/us
ReadVLongBenchmark.readVLong         128          49  thrpt    5   1.511 ± 0.012  ops/us
ReadVLongBenchmark.readVLong         128          56  thrpt    5   1.463 ± 0.005  ops/us
ReadVLongBenchmark.readVLong         128          63  thrpt    5   1.299 ± 0.092  ops/us
ReadVLongBenchmark.readVLongMSB      128           7  thrpt    5  28.622 ± 0.865  ops/us
ReadVLongBenchmark.readVLongMSB      128          14  thrpt    5   4.479 ± 0.013  ops/us
ReadVLongBenchmark.readVLongMSB      128          21  thrpt    5   3.647 ± 0.007  ops/us
ReadVLongBenchmark.readVLongMSB      128          28  thrpt    5   2.434 ± 0.372  ops/us
ReadVLongBenchmark.readVLongMSB      128          35  thrpt    5   1.885 ± 0.006  ops/us
ReadVLongBenchmark.readVLongMSB      128          42  thrpt    5   1.516 ± 0.005  ops/us
ReadVLongBenchmark.readVLongMSB      128          49  thrpt    5   1.271 ± 0.004  ops/us
ReadVLongBenchmark.readVLongMSB      128          56  thrpt    5   1.095 ± 0.003  ops/us
ReadVLongBenchmark.readVLongMSB      128          63  thrpt    5   0.961 ± 0.002  ops/us

Code

package testing;

import org.apache.lucene.store.ByteArrayDataOutput;
import org.apache.lucene.store.DataOutput;
import org.apache.lucene.util.packed.PackedInts;
import org.openjdk.jmh.annotations.*;

import java.io.IOException;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
@Warmup(iterations = 3, time = 3)
@Measurement(iterations = 5, time = 3)
@Fork(value = 1, jvmArgsPrepend = {"--add-modules=jdk.incubator.vector"})
public class ReadVLongBenchmark {

  private byte[] bytes;
  private byte[] msbBytes;
  private int written;

  @Param({"128"})
  int size;

  @Param({"7", "14", "21", "28", "35", "42", "49", "56", "63"})
  int valueBit;

  @Setup(Level.Trial)
  public void init() throws Exception {
    bytes = new byte[size * (valueBit / 7)];
    msbBytes = new byte[size * (valueBit / 7)];
    ByteArrayDataOutput out = new ByteArrayDataOutput(bytes);
    ByteArrayDataOutput msbOut = new ByteArrayDataOutput(msbBytes);
    written = 0;
    final long min = (1L << valueBit - 1);
    final long max = valueBit == 63 ? Long.MAX_VALUE : (1L << valueBit) - 1;
    while (out.getPosition() < bytes.length - 10) {
      written++;
      long l = ThreadLocalRandom.current().nextLong(min, max);
      if (PackedInts.bitsRequired(l) != valueBit) {
        throw new IllegalStateException();
      }
      out.writeVLong(l);
      writeMSBVLong(l, msbOut);
    }
    if (readVLong() != writeVLongMSB()) {
      throw new IllegalStateException();
    }
  }

  @Benchmark
  public long readVLong() {
    int pos = 0;
    long res = 0;
    for (int iter = 0; iter < written; iter++) {
      byte b = bytes[pos++];
      long i = b & 0x7F;
      for (int shift = 7; (b & 0x80) != 0; shift += 7) {
        b = bytes[pos++];
        i |= (b & 0x7FL) << shift;
      }
      res ^= i;
    }
    return res;
  }

  @Benchmark
  public long readVLongMSB() {
    int pos = 0;
    long res = 0;
    for (int iter = 0; iter < written; iter++) {
      long i = 0L;
      while (true) {
        byte b = msbBytes[pos++];
        i = (i << 7) | (b & 0x7FL);
        if ((b & 0x80) == 0) {
          break;
        }
      }
      res ^= i;
    }
    return res;
  }

  static void writeMSBVLong(long l, DataOutput scratchBytes) throws IOException {
    assert l >= 0;
    // Keep zero bits on most significant byte to have more chance to get prefix bytes shared.
    // e.g. we expect 0x7FFF stored as [0x81, 0xFF, 0x7F] but not [0xFF, 0xFF, 0x40]
    final int bytesNeeded = (Long.SIZE - Long.numberOfLeadingZeros(l) - 1) / 7 + 1;
    l <<= Long.SIZE - bytesNeeded * 7;
    for (int i = 1; i < bytesNeeded; i++) {
      scratchBytes.writeByte((byte) (((l >>> 57) & 0x7FL) | 0x80));
      l = l << 7;
    }
    scratchBytes.writeByte((byte) (((l >>> 57) & 0x7FL)));
  }
}

gf2121 · 2023-10-12T09:21:04Z

I tried another reading way like:

byte b = msbBytes[pos++];
long i = b & 0x7FL;
while (b < 0) {
  b = msbBytes[pos++];
  i = (i << 7) | (b & 0x7FL);
}

Following is the benchmark result for this (readVLongMSBNew)

Comparing readVLongMSB and readVLong, some valueBits are better but others worse.
Comparing readMSBVLongNew and readVLong, readVLongMSBNew win for all valueBit.

I'm considering try the new reading way and see what will happen for nightly benchmark.

Benchmark                           (size)  (valueBit)   Mode  Cnt   Score   Error   Units
ReadVLongBenchmark.readVLong           128           7  thrpt    5  29.333 ± 0.626  ops/us
ReadVLongBenchmark.readVLong           128          14  thrpt    5   3.643 ± 0.026  ops/us
ReadVLongBenchmark.readVLong           128          21  thrpt    5   2.411 ± 0.047  ops/us
ReadVLongBenchmark.readVLong           128          28  thrpt    5   2.075 ± 0.004  ops/us
ReadVLongBenchmark.readVLong           128          35  thrpt    5   1.836 ± 0.002  ops/us
ReadVLongBenchmark.readVLong           128          42  thrpt    5   1.943 ± 0.006  ops/us
ReadVLongBenchmark.readVLong           128          49  thrpt    5   1.514 ± 0.006  ops/us
ReadVLongBenchmark.readVLong           128          56  thrpt    5   1.463 ± 0.002  ops/us
ReadVLongBenchmark.readVLong           128          63  thrpt    5   1.338 ± 0.002  ops/us
ReadVLongBenchmark.readVLongMSB        128           7  thrpt    5  29.350 ± 0.140  ops/us
ReadVLongBenchmark.readVLongMSB        128          14  thrpt    5   4.474 ± 0.025  ops/us
ReadVLongBenchmark.readVLongMSB        128          21  thrpt    5   3.640 ± 0.033  ops/us
ReadVLongBenchmark.readVLongMSB        128          28  thrpt    5   2.439 ± 0.324  ops/us
ReadVLongBenchmark.readVLongMSB        128          35  thrpt    5   1.879 ± 0.009  ops/us
ReadVLongBenchmark.readVLongMSB        128          42  thrpt    5   1.508 ± 0.037  ops/us
ReadVLongBenchmark.readVLongMSB        128          49  thrpt    5   1.268 ± 0.012  ops/us
ReadVLongBenchmark.readVLongMSB        128          56  thrpt    5   1.091 ± 0.024  ops/us
ReadVLongBenchmark.readVLongMSB        128          63  thrpt    5   0.960 ± 0.005  ops/us
ReadVLongBenchmark.readVLongMSBNew     128           7  thrpt    5  31.110 ± 0.254  ops/us
ReadVLongBenchmark.readVLongMSBNew     128          14  thrpt    5   4.075 ± 0.010  ops/us
ReadVLongBenchmark.readVLongMSBNew     128          21  thrpt    5   2.600 ± 0.008  ops/us
ReadVLongBenchmark.readVLongMSBNew     128          28  thrpt    5   2.244 ± 0.006  ops/us
ReadVLongBenchmark.readVLongMSBNew     128          35  thrpt    5   1.946 ± 0.005  ops/us
ReadVLongBenchmark.readVLongMSBNew     128          42  thrpt    5   2.102 ± 0.016  ops/us
ReadVLongBenchmark.readVLongMSBNew     128          49  thrpt    5   1.615 ± 0.209  ops/us
ReadVLongBenchmark.readVLongMSBNew     128          56  thrpt    5   1.614 ± 0.032  ops/us
ReadVLongBenchmark.readVLongMSBNew     128          63  thrpt    5   1.465 ± 0.008  ops/us

CODE

package testing;

import org.apache.lucene.store.ByteArrayDataOutput;
import org.apache.lucene.store.DataOutput;
import org.apache.lucene.util.packed.PackedInts;
import org.openjdk.jmh.annotations.*;

import java.io.IOException;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Benchmark)
@Warmup(iterations = 3, time = 3)
@Measurement(iterations = 5, time = 3)
@Fork(value = 1, jvmArgsPrepend = {"--add-modules=jdk.incubator.vector"})
public class ReadVLongBenchmark {

  private byte[] bytes;
  private byte[] msbBytes;

  @Param({"128"})
  int size;

  @Param({"7", "14", "21", "28", "35", "42", "49", "56", "63"})
  int valueBit;

  @Setup(Level.Trial)
  public void init() throws Exception {
    bytes = new byte[size * (valueBit / 7)];
    msbBytes = new byte[size * (valueBit / 7)];
    ByteArrayDataOutput out = new ByteArrayDataOutput(bytes);
    ByteArrayDataOutput msbOut = new ByteArrayDataOutput(msbBytes);
    final long min = 1L << (valueBit - 1);
    final long max = valueBit == 63 ? Long.MAX_VALUE : (1L << valueBit) - 1;
    for (int i = 0; i < size; i++) {
      long l = ThreadLocalRandom.current().nextLong(min, max);
      if (PackedInts.bitsRequired(l) != valueBit) {
        throw new IllegalStateException();
      }
      out.writeVLong(l);
      writeMSBVLong(l, msbOut);
    }
    if (readVLong() != readVLongMSB()) {
      System.out.println("vlong msb wrong");
      throw new IllegalStateException();
    }
    if (readVLong() != readVLongMSBNew()) {
      System.out.println("vlong new msb wrong");
      throw new IllegalStateException();
    }
  }

  @Benchmark
  public long readVLong() {
    int pos = 0;
    long res = 0;
    for (int iter = 0; iter < size; iter++) {
      byte b = bytes[pos++];
      long i = b & 0x7F;
      for (int shift = 7; (b & 0x80) != 0; shift += 7) {
        b = bytes[pos++];
        i |= (b & 0x7FL) << shift;
      }
      res ^= i;
    }
    return res;
  }

  @Benchmark
  public long readVLongMSB() {
    int pos = 0;
    long res = 0;
    for (int iter = 0; iter < size; iter++) {
      long i = 0L;
      while (true) {
        byte b = msbBytes[pos++];
        i = (i << 7) | (b & 0x7FL);
        if ((b & 0x80) == 0) {
          break;
        }
      }
      res ^= i;
    }
    return res;
  }

  @Benchmark
  public long readVLongMSBNew() {
    int pos = 0;
    long res = 0;
    for (int iter = 0; iter < size; iter++) {
      byte b = msbBytes[pos++];
      long i = b & 0x7FL;
      while (b < 0) {
        b = msbBytes[pos++];
        i = (i << 7) | (b & 0x7FL);
      }
      res ^= i;
    }
    return res;
  }

  static void writeMSBVLong(long l, DataOutput scratchBytes) throws IOException {
    assert l >= 0;
    // Keep zero bits on most significant byte to have more chance to get prefix bytes shared.
    // e.g. we expect 0x7FFF stored as [0x81, 0xFF, 0x7F] but not [0xFF, 0xFF, 0x40]
    final int bytesNeeded = (Long.SIZE - Long.numberOfLeadingZeros(l) - 1) / 7 + 1;
    l <<= Long.SIZE - bytesNeeded * 7;
    for (int i = 1; i < bytesNeeded; i++) {
      scratchBytes.writeByte((byte) (((l >>> 57) & 0x7FL) | 0x80));
      l = l << 7;
    }
    scratchBytes.writeByte((byte) (((l >>> 57) & 0x7FL)));
  }
}

gf2121 · 2023-10-12T09:30:39Z

Another possible reason I'm thinking is that maybe readMSBVLong is not as hot as readVLong so compiler is not optimizing it enough, or the readbyte in statistic context may confuse JVM with the abstraction layer?

gf2121 added the type:enhancement label Oct 12, 2023

gf2121 mentioned this issue Oct 12, 2023

Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum #12661

Closed

gf2121 mentioned this issue Nov 13, 2023

Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum #12699

Merged

gf2121 closed this as completed Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nightly benchmark regression for term dict queries #12659

Nightly benchmark regression for term dict queries #12659

gf2121 commented Oct 12, 2023 •

edited

Loading

gf2121 commented Oct 12, 2023 •

edited

Loading

gf2121 commented Oct 12, 2023

gf2121 commented Oct 12, 2023

Nightly benchmark regression for term dict queries #12659

Nightly benchmark regression for term dict queries #12659

Comments

gf2121 commented Oct 12, 2023 • edited Loading

Description

gf2121 commented Oct 12, 2023 • edited Loading

gf2121 commented Oct 12, 2023

gf2121 commented Oct 12, 2023

gf2121 commented Oct 12, 2023 •

edited

Loading

gf2121 commented Oct 12, 2023 •

edited

Loading