Optimize `count()` for BooleanQuery disjunction #12358

mikemccand · 2023-06-08T12:32:11Z

Description

Context: we (Amazon customer facing product search team, and also AWS) are attempting to understand the amazing performance Tantivy (Rust search engine) has over Lucene, iterating in this GitHub repo. That repo is sort of a merger of Lucene's benchmarking code (luceneutil), including its tasks and enwiki corpus, and the open source Tantivy benchmark. Tantivy is impressively fast :)

This issue is a spinoff from this fascinating comment by @fulmicoton, creator and maintainer of Tantivy.

Tantivy optimizes count() for BooleanQuery disjunctions much like Lucene's BooleanScorer, by scoring in a windowed bitset of N docs at once, and then pop-counting the set bits in each window. This is not technically a sub-linear implementation: it is still linear, but I suspect with a smaller constant factor than the default count() fallback Lucene implements.

Perhaps, for all cases where BooleanQuery uses the windowed BooleanScorer, we could also implement this count() optimization.

From my read of Lucene's BooleanWeight.count, I don't think Lucene has this optimization? Maybe we should port over Tantivy's optimization? It should make disjunctive counting quite a bit faster?

The text was updated successfully, but these errors were encountered:

jpountz · 2023-06-09T07:59:03Z

From my read of Lucene's BooleanWeight.count, I don't think Lucene has this optimization? Maybe we should port over Tantivy's optimization? It should make disjunctive counting quite a bit faster?

BooleanWeight#count doesn't allow to do this kind of thing currently, because its contract is that it should run in constant-time or so, so that you can compute its result and then ignore it if you can't actually make use of it. E.g. for conjunctions, we call BooleanWeight#count over all clauses one by one and stop when more than one clause returns a results that is different from numDocs, and then we fall back to regular counting with TotalHitCountCollector.

+1 to update the Weight#count API contract or introduce a new API to better optimize counting.

uschindler · 2023-06-09T23:48:22Z

I checked the repository. When Tantitvy says it is 2 times faster then this is also unfair: The benchmark does not warm up the JVM. It runs the list of queries in cold, unoptimized state - COLD!

Very unfair, sorry. See code:

So basically it runs the queries on a cold JVM so the first queries in the list are all running in interpreted mode, especially when the new vector features or mmapdir are used.

Did I oversee something?

uschindler · 2023-06-09T23:51:31Z

OK, it seems to have some warmup in the python code, but I did not find the number of warmup queries.

fulmicoton · 2023-06-10T01:48:07Z

When Tantitvy says it is 2 times faster then this is also unfair

It's called tantivy. We do have a warmup phase.

Very unfair, sorry.
Did I oversee something?

Yes, you did oversee something.

uschindler · 2023-06-10T07:16:48Z

I found the python code (see my 2nd comment), but where's the number of warmup queries?

fulmicoton · 2023-06-10T07:44:43Z

It is given as an environnement variable. You can find the default value in the Makefile. It is 1 round.

1 round means run the entire benchmark once.
The bench does not run queries in a AAAAABBBBBCCCC order but in a ABCABCABC order.
So one round means running ABC.

Also the results shown are always showing the minimum of the n runs. Not the mean.

uschindler · 2023-06-10T08:01:31Z

Ok thanks, so it executes all queries once and then starts over in the same JVM? - I hope it does not start a new JVM each round.

The default default config of mike's luceneutil benchmark is also not optimal with that and needs tuning (there are some discussions in recent commits about vector support and mmap). Especially with the brand new optimizations in recent Lucene 9.7 commits it gets more and more important to give Lucene enough time to warm itsself up. Hotspot is very slow with applying those optos (that is what Rust does better) and due to the incubation state of those vector optos, the garbage collector has much to do in early phases. So it is important to let run Lucene for a few minutes to let GC clean up the "optimization garbage" till it starts measuring in the same JVM. If you start a new JVM, you have to wait the warmup time again. Unfortunately number of queries is not best in measuring this, it should be a mix of number of queries and execution time to do warmup.

The problem of many Lucene tests on the web is that they start after 100 or 200 queries with measuring or restart JVMs. So please excuse my complaint as the pure Java code did not show how it warms up as it was hidden behind the python code. Still the mix of several languages makes it hard to understand.

fulmicoton · 2023-06-10T09:02:25Z

Yes it is the same JVM.
We can try to increase the warmup iter to let it run for one hour, and see if the results change?

Also, do you have an idea of what is the best JVM and the best settings we should use.

mikemccand · 2023-06-10T09:33:53Z

@uschindler I think the benchmark is quite fair at this point, from what I can tell -- we've been iterating on that repo working to pull over luceneutil's enwiki corpus + tasks, ensuring the returned top N hits are very nearly identical, ensuring warmup + N iterations and take the fastest run is happening, measuring time on the server not the client, running both on a single segment index (makes results more comparable but not necessarily realistic to production cases), sprinkle 2% deletes on both indices, statically sorting both indices by the same id field, making max token length 256 (not 255) across both engines, ensuring tantivy is also splitting on full Unicode whitespace not just ascii, etc.

Many thanks to @Tony-X for pushing so hard on this.

It does indeed run a single JVM, but I an of the opposite opinion -- we need to run multiple JVM iterations to see how much hotspot noise is hurting. Darned java/hotspot coddling! The Rust results look quite consistent from one run to the next.

It's also simple to run -- fork/clone the repo and follow the README steps. I'd love to see results from more diverse hardware than just my home beast boxes lol.

Still the mix of several languages makes it hard to understand.

Ha! Java and Python and Rust! (Edit: oh and also Makefile!!)

Start at client.py -- I think it's quite understandable if you launch off from there. More so than luceneutil!!

When Tantitvy says it is 2 times faster then this is also unfair

It's called tantivy. We do have a warmup phase.

Oh -- sorry -- you mean no capitalization? So tantivy not Tantivy? I'll try to spell it correctly going forward.

mikemccand · 2023-06-10T09:37:31Z

BooleanWeight#count doesn't allow to do this kind of thing currently, because its contract is that it should run in constant-time or so, so that you can compute its result and then ignore it if you can't actually make use of it. E.g. for conjunctions, we call BooleanWeight#count over all clauses one by one and stop when more than one clause returns a results that is different from numDocs, and then we fall back to regular counting with TotalHitCountCollector.

Ahh -- gotchya -- only if the impl will be constant time is it allowed to run, else -1 (also in constant time).

OK so it's a bit tricky to add this opto to Lucene ... we need an API change of some sort.

It's as if when IndexSearcher.count realizes it must fallback to counting each hit individually, it should then call maybe another API, which BQ would implement, to try to do that count more efficiently. Hmm.

mikemccand · 2023-06-10T09:42:27Z

Also, do you have an idea of what is the best JVM and the best settings we should use.

I think @uschindler is referring to the recent changes to let Lucene use Panama (rough SIMD access from way up in javaland) if the user opts in at runtime. We could test this (add --add-modules jdk.incubator.vector) but I suspect it's unlikely to help since so far Lucene only uses Panama for KNN search, I think?

We should fix the JVM heap size I suppose -- I'll try that to see if it alters results.

uschindler · 2023-06-10T10:05:58Z

Hi. It looks like the DoQuery.java code does not do a parallel throughput measurement, but instead it runs all queries in a single thread one after each other with Nanotime before and after (thanks for the fix, Mike). So we measure exactly duration of each query. So we should use also ParallelGC. The default G1GC works better when you hammer a server multithreaded, but if there's only one thread doing queries, ParallelGC is better.

Of course a real world benchmark should also measure throughput by hammering a server with hundreds of parallel queries (many more than there are CPU cores) to saturate all CPU cores. Of course in such throughout scenarios I have seen sometimes single queries taking long time, but you need to also look at percentiles then.

I know that lucene is very good in throughput measurements.

I know this comment goes too far and beyond this issue, but we should really look at other scenarios than measuring the duration a query takes.

About vector: this does not apply here because there's no vector search involved. Still with modern Java version like jdk-20 the warmup time in combination with parallelgc is higher due to tiered compilation.

Please use java 20 for benchmarks to also see benefits from mmap, especially with indexes optimized to one segment. Also enable parallel GC, although it's not real world, but the benchmark isn't, too.

Please do not pass any extra JVM args, except GC and heap size.

mikemccand · 2023-06-10T10:16:29Z

The default G1GC works better when you hammer a server multithreaded, but if there's only one thread doing queries, ParallelGC is better.

Ahh great catch @uschindler! I just made that change in the benchy.

Hi. It looks like the DoQuery.java code does not do a parallel throughput measurement, but instead it runs all queries in a single thread one after each other with Nanotime before and after (thanks for the fix, Mike)

Right -- we are only measuring single threaded query latency (just like luceneutil).

I agree a red-line test (saturate CPU) is also really important -- we have an issue open for this. It's a bit trickier to do...

Note that neither tantivy nor Lucene is using more than one thread for each query, nor any OS level concurrency like async IO (I think?), so the "lightly loaded single query latency" metric should be fair to compare across both engines.

uschindler · 2023-06-10T10:41:23Z

Here is is also missing, that's the code I looked at: https://github.com/Tony-X/search-benchmark-game/blob/4402d42c906830e85d8d79a30ae776f204ade770/engines/lucene-9.5.0/query.sh

fulmicoton · 2023-06-10T11:35:38Z

Oh -- sorry -- you mean no capitalization? So tantivy not Tantivy? I'll try to spell it correctly going forward.

@mikemccand No no. There is no specific capitalization :). I was correcting tantivity -> tantivy.

uschindler · 2023-06-10T11:46:02Z

Oh -- sorry -- you mean no capitalization? So tantivy not Tantivy? I'll try to spell it correctly going forward.

@mikemccand No no. There is no specific capitalization :). I was correcting tantivity -> tantivy.

too late in evening and on smartphone. Whatever Google decided to be the correct term :-)

mikemccand · 2023-06-10T11:53:53Z

Oh -- sorry -- you mean no capitalization? So tantivy not Tantivy? I'll try to spell it correctly going forward.

@mikemccand No no. There is no specific capitalization :). I was correcting tantivity -> tantivy.

OH! Ha, OK, phew :)

mikemccand · 2023-06-10T11:57:06Z

Here is is also missing, that's the code I looked at: https://github.com/Tony-X/search-benchmark-game/blob/4402d42c906830e85d8d79a30ae776f204ade770/engines/lucene-9.5.0/query.sh

Oh yeah that does look scary at first, but that is the sh script to start up the "server" that handles many queries through pipes. It is how client.py launches Lucene's server to run its benchmarks (many queries in one exe invocation).

It's invoked from Lucene's Makefile in that same directory for the make serve target.

uschindler · 2023-06-10T12:17:36Z

Please use java 20 for benchmarks to also see benefits from mmap, especially with indexes optimized to one segment. Also enable parallel GC, although it's not real world, but the benchmark isn't, too.

Small correction as I noticed you are on Lucene 9.5:

for Lucene 9.5, use Java 19
for Lucene 9.6, use Java 20

fulmicoton · 2023-06-10T12:33:10Z

Of course a real world benchmark should also measure throughput by hammering a server with hundreds of parallel queries (many more than there are CPU cores) to saturate all CPU cores.

The goal of the original benchmark is to identify performance headrooms in tantivy.
That's why it measures different kind of queries, does not deal with multithreading, works over a single segment, and compares "min" and not "mean".
"min" removes a lot of the noise and is for this reason a bit generous for Java...
but it is much better to give us the right insights.

BTW I suspect changing the GC won't have any effect on the min statistics, but it does not hurt to try.

I'd love to see a fair bench of multithreaded search throughput / indexing of course.
Making it fair is however much trickier.

uschindler · 2023-06-10T12:55:46Z

BTW I suspect changing the GC won't have any effect on the min statistics, but it does not hurt to try.

The problem is that Java's default GC G1 brings up to 10% decrease in speed due to addition of barriers, but cleans up garbage much faster with shorter pauses. This is fine for multithreaded apps, but not single threaded apps doing wall clock measurements.

Tony-X · 2023-06-10T18:11:35Z

Just caught up on this thread -- the design tenet of the current benchmark game is to measure time taken to do the same work in contention-free environment.

As of now I'm still trying to build trust of the benchmarks so thank you for your evaluation and feedbacks @uschindler !

So far I believe there are doing the "same" work as I have chased down a few tokenization issues. Right now the indexes on both side have --

almost "same" tokenization -- split by whitespaces and remove tokens with length >=256
same index sort
same set of deleted docs (2% in total)
single segment

Regarding the JVM here is what we do now

warm up the JVM with 6.1k query for each COUNT and TOP_10_COUNT. We could increase the warmup iterations easily here. As I was typing, I already changed warmup iter to 3 and kicked off a run.

Admittedly we haven't looked into playing with different JVM arguments. @mikemccand thanks for creating Tony-X/search-benchmark-game#37 to explore the heap sizes :)

IMO, GC here is less of an issue since we measure the best latency (min) across 10 runs for each query (a slight favor for JVM). The probability that every 10 of 10 run of the same query hit an GC is very tiny.

It would be great to share your insights about an optimal JVM setting for this case.

Tony-X · 2023-06-10T18:24:38Z

https://github.com/Tony-X/search-benchmark-game/blob/a6f71237e84aeb47ee8f5b923f16b760b032e1fc/engines/lucene-9.5.0/query.sh

This is not used in benchmarks. This is more like a debugging tool, e.g. you can run that script and type some queries.

msokolov · 2023-06-12T16:22:57Z

No no. There is no specific capitalization :). I was correcting tantivity -> tantivy.

Just to keep everyone accurate. the original misspelling was "tantitvy" not "tantivity"! :)

I checked the repository. When Tantitvy says it is 2 times faster ...

fulmicoton · 2023-06-14T02:49:00Z

@uschindler I updated the bench of the search benchmarks with:

1 hour warmup (1800 iteration) https://tantivy-search.github.io/bench/
lucene 9.6.0
tantivy 0.20
rust 1.70
a proper AWS c5n.xlarge instance'
openjdk version "11.0.19" 2023-04-18

uschindler · 2023-06-14T12:41:33Z

Hi, thanks for crosschecking. 1 hour warmup is therefor not changing anything.

Anyways, I'd use a newer JDK like 20.

zhaih · 2023-06-16T20:37:54Z

After roughly catching up with the threads, I would like to go back to

+1 to update the Weight#count API contract or introduce a new API to better optimize counting.

So we probably can have two count API, constantCount and fastCount?
Then the contract can become, IS will call constantCount first, then if:

it returns -2 then it means there's no possible constant time count, but a possible faster than normal count, then IS will call fastCount to retrieve count;
it returns -1 then it means no faster count possible, go back to use query and do count normally?
it returns a natural number then it's the count.

Does that makes sense?

jpountz · 2023-06-19T10:12:04Z

I was wondering about a possible alternative approach that would consist of handling this optimization more at the collector level. E.g. we could add a new method to LeafCollector to take a set of doc IDs (pseudo code):

void collect(DocIdSet set) {
  // default implementation
  for (int docID : set) {
    collect(docID);
  }
}

Then BooleanScorer could pass a DocIdSet to the collector for each window, and TotalHitCountCollector could override the above method to instead do something like this:

void collect(DocIdSet set) {
  // total hit count implementation
  count += set.size(); // internally uses Long#bitCount rather than iterating bits one by one
}

mikemccand · 2023-06-20T17:27:36Z

E.g. we could add a new method to LeafCollector to take a set of doc IDs (pseudo code):

I like this idea! Sort of like BulkScorer, which can score/collect N docs at a time.

zhaih · 2023-06-20T17:38:30Z

we could add a new method to LeafCollector to take a set of doc IDs (pseudo code)

+1, I think this is a better/cleaner approach

jpountz · 2023-06-20T19:30:20Z

I like it better too but I think it's going to need more work, e.g. we will need to remove the Scorable#docID() API to avoid running into cases when the doc ID passed to LeafCollector#collect is not the same as the doc ID exposed by Scorable#docID()?

zhaih · 2023-06-28T17:07:26Z

Maybe we need a BulkScorable or something which holds multiple Scorable (or just holds an array of scores) and set the contract that collect(DocIdSet should use BulkScorable but not normal Scorable? (This sounds a little bit hacky tho...)

`Scorable#docID()` exposes the document that is being collected, which makes it impossible to bulk-collect multiple documents at once. Relates apache#12358

jpountz · 2023-06-30T14:20:09Z

It felt to me like it would be doable to remove Scorable#docID without introducing a replacement, so I gave it a try at #12407.

`Scorable#docID()` exposes the document that is being collected, which makes it impossible to bulk-collect multiple documents at once. Relates #12358

This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by creating a `DocIdStream` whose `count()` method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches. On wikimedium10m, this yields a ~20% speedup when counting hits for the `title OR 12` query (2.9M hits). Relates apache#12358

jpountz · 2023-07-05T11:56:31Z

I opened a proof of concept for the idea that I suggested above at #12415.

This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by creating a `DocIdStream` whose `count()` method counts the number of bits that are set in the bit set of matches in the current window, instead of naively iterating over all matches. On wikimedium10m, this yields a ~20% speedup when counting hits for the `title OR 12` query (2.9M hits). Relates #12358

mikemccand added the type:enhancement label Jun 8, 2023

mikemccand mentioned this issue Jun 8, 2023

Can we disable BMW for both engines and test performance? Tony-X/search-benchmark-game#30

Closed

mikemccand mentioned this issue Jun 10, 2023

Test different heap sizes for Lucene benchy Tony-X/search-benchmark-game#37

Open

jpountz added a commit to jpountz/lucene that referenced this issue Jun 30, 2023

Remove Scorable#docID.

32421df

`Scorable#docID()` exposes the document that is being collected, which makes it impossible to bulk-collect multiple documents at once. Relates apache#12358

jpountz mentioned this issue Jun 30, 2023

Remove Scorable#docID. #12407

Merged

jpountz added a commit that referenced this issue Jul 5, 2023

Remove Scorable#docID. (#12407)

f527eb3

`Scorable#docID()` exposes the document that is being collected, which makes it impossible to bulk-collect multiple documents at once. Relates #12358

jpountz mentioned this issue Jul 5, 2023

Optimize disjunction counts. #12415

Merged

mikemccand mentioned this issue Aug 2, 2023

Test with the throughput GC, not G1GC (the default) Tony-X/search-benchmark-game#45

Closed

Tony-X mentioned this issue Aug 21, 2023

Run benchmark on Graviton3 instance Tony-X/search-benchmark-game#36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `count()` for BooleanQuery disjunction #12358

Optimize `count()` for BooleanQuery disjunction #12358

mikemccand commented Jun 8, 2023

jpountz commented Jun 9, 2023

uschindler commented Jun 9, 2023 •

edited

Loading

uschindler commented Jun 9, 2023

fulmicoton commented Jun 10, 2023

uschindler commented Jun 10, 2023

fulmicoton commented Jun 10, 2023

uschindler commented Jun 10, 2023 •

edited

Loading

fulmicoton commented Jun 10, 2023

mikemccand commented Jun 10, 2023 •

edited

Loading

mikemccand commented Jun 10, 2023

mikemccand commented Jun 10, 2023

uschindler commented Jun 10, 2023 •

edited

Loading

mikemccand commented Jun 10, 2023

uschindler commented Jun 10, 2023

fulmicoton commented Jun 10, 2023

uschindler commented Jun 10, 2023

mikemccand commented Jun 10, 2023

mikemccand commented Jun 10, 2023

uschindler commented Jun 10, 2023

fulmicoton commented Jun 10, 2023 •

edited

Loading

uschindler commented Jun 10, 2023 •

edited

Loading

Tony-X commented Jun 10, 2023

Tony-X commented Jun 10, 2023 •

edited

Loading

msokolov commented Jun 12, 2023

fulmicoton commented Jun 14, 2023 •

edited

Loading

uschindler commented Jun 14, 2023

zhaih commented Jun 16, 2023

jpountz commented Jun 19, 2023 •

edited

Loading

mikemccand commented Jun 20, 2023

zhaih commented Jun 20, 2023

jpountz commented Jun 20, 2023

zhaih commented Jun 28, 2023

jpountz commented Jun 30, 2023

jpountz commented Jul 5, 2023

Optimize count() for BooleanQuery disjunction #12358

Optimize count() for BooleanQuery disjunction #12358

Comments

mikemccand commented Jun 8, 2023

Description

jpountz commented Jun 9, 2023

uschindler commented Jun 9, 2023 • edited Loading

uschindler commented Jun 9, 2023

fulmicoton commented Jun 10, 2023

uschindler commented Jun 10, 2023

fulmicoton commented Jun 10, 2023

uschindler commented Jun 10, 2023 • edited Loading

fulmicoton commented Jun 10, 2023

mikemccand commented Jun 10, 2023 • edited Loading

mikemccand commented Jun 10, 2023

mikemccand commented Jun 10, 2023

uschindler commented Jun 10, 2023 • edited Loading

mikemccand commented Jun 10, 2023

uschindler commented Jun 10, 2023

fulmicoton commented Jun 10, 2023

uschindler commented Jun 10, 2023

mikemccand commented Jun 10, 2023

mikemccand commented Jun 10, 2023

uschindler commented Jun 10, 2023

fulmicoton commented Jun 10, 2023 • edited Loading

uschindler commented Jun 10, 2023 • edited Loading

Tony-X commented Jun 10, 2023

Tony-X commented Jun 10, 2023 • edited Loading

msokolov commented Jun 12, 2023

fulmicoton commented Jun 14, 2023 • edited Loading

uschindler commented Jun 14, 2023

zhaih commented Jun 16, 2023

jpountz commented Jun 19, 2023 • edited Loading

mikemccand commented Jun 20, 2023

zhaih commented Jun 20, 2023

jpountz commented Jun 20, 2023

zhaih commented Jun 28, 2023

jpountz commented Jun 30, 2023

jpountz commented Jul 5, 2023

Optimize `count()` for BooleanQuery disjunction #12358

Optimize `count()` for BooleanQuery disjunction #12358

uschindler commented Jun 9, 2023 •

edited

Loading

uschindler commented Jun 10, 2023 •

edited

Loading

mikemccand commented Jun 10, 2023 •

edited

Loading

uschindler commented Jun 10, 2023 •

edited

Loading

fulmicoton commented Jun 10, 2023 •

edited

Loading

uschindler commented Jun 10, 2023 •

edited

Loading

Tony-X commented Jun 10, 2023 •

edited

Loading

fulmicoton commented Jun 14, 2023 •

edited

Loading

jpountz commented Jun 19, 2023 •

edited

Loading