Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize count() for BooleanQuery disjunction #12358

Open
mikemccand opened this issue Jun 8, 2023 · 34 comments
Open

Optimize count() for BooleanQuery disjunction #12358

mikemccand opened this issue Jun 8, 2023 · 34 comments

Comments

@mikemccand
Copy link
Member

Description

Context: we (Amazon customer facing product search team, and also AWS) are attempting to understand the amazing performance Tantivy (Rust search engine) has over Lucene, iterating in this GitHub repo. That repo is sort of a merger of Lucene's benchmarking code (luceneutil), including its tasks and enwiki corpus, and the open source Tantivy benchmark. Tantivy is impressively fast :)

This issue is a spinoff from this fascinating comment by @fulmicoton, creator and maintainer of Tantivy.

Tantivy optimizes count() for BooleanQuery disjunctions much like Lucene's BooleanScorer, by scoring in a windowed bitset of N docs at once, and then pop-counting the set bits in each window. This is not technically a sub-linear implementation: it is still linear, but I suspect with a smaller constant factor than the default count() fallback Lucene implements.

Perhaps, for all cases where BooleanQuery uses the windowed BooleanScorer, we could also implement this count() optimization.

From my read of Lucene's BooleanWeight.count, I don't think Lucene has this optimization? Maybe we should port over Tantivy's optimization? It should make disjunctive counting quite a bit faster?

@jpountz
Copy link
Contributor

jpountz commented Jun 9, 2023

From my read of Lucene's BooleanWeight.count, I don't think Lucene has this optimization? Maybe we should port over Tantivy's optimization? It should make disjunctive counting quite a bit faster?

BooleanWeight#count doesn't allow to do this kind of thing currently, because its contract is that it should run in constant-time or so, so that you can compute its result and then ignore it if you can't actually make use of it. E.g. for conjunctions, we call BooleanWeight#count over all clauses one by one and stop when more than one clause returns a results that is different from numDocs, and then we fall back to regular counting with TotalHitCountCollector.

+1 to update the Weight#count API contract or introduce a new API to better optimize counting.

@uschindler
Copy link
Contributor

uschindler commented Jun 9, 2023

I checked the repository. When Tantitvy says it is 2 times faster then this is also unfair: The benchmark does not warm up the JVM. It runs the list of queries in cold, unoptimized state - COLD!

Very unfair, sorry. See code:

So basically it runs the queries on a cold JVM so the first queries in the list are all running in interpreted mode, especially when the new vector features or mmapdir are used.

Did I oversee something?

@uschindler
Copy link
Contributor

OK, it seems to have some warmup in the python code, but I did not find the number of warmup queries.

@fulmicoton
Copy link

When Tantitvy says it is 2 times faster then this is also unfair

It's called tantivy. We do have a warmup phase.

Very unfair, sorry.
Did I oversee something?

Yes, you did oversee something.

@uschindler
Copy link
Contributor

I found the python code (see my 2nd comment), but where's the number of warmup queries?

@fulmicoton
Copy link

It is given as an environnement variable. You can find the default value in the Makefile. It is 1 round.

1 round means run the entire benchmark once.
The bench does not run queries in a AAAAABBBBBCCCC order but in a ABCABCABC order.
So one round means running ABC.

Also the results shown are always showing the minimum of the n runs. Not the mean.

@uschindler
Copy link
Contributor

uschindler commented Jun 10, 2023

Ok thanks, so it executes all queries once and then starts over in the same JVM? - I hope it does not start a new JVM each round.

The default default config of mike's luceneutil benchmark is also not optimal with that and needs tuning (there are some discussions in recent commits about vector support and mmap). Especially with the brand new optimizations in recent Lucene 9.7 commits it gets more and more important to give Lucene enough time to warm itsself up. Hotspot is very slow with applying those optos (that is what Rust does better) and due to the incubation state of those vector optos, the garbage collector has much to do in early phases. So it is important to let run Lucene for a few minutes to let GC clean up the "optimization garbage" till it starts measuring in the same JVM. If you start a new JVM, you have to wait the warmup time again. Unfortunately number of queries is not best in measuring this, it should be a mix of number of queries and execution time to do warmup.

The problem of many Lucene tests on the web is that they start after 100 or 200 queries with measuring or restart JVMs. So please excuse my complaint as the pure Java code did not show how it warms up as it was hidden behind the python code. Still the mix of several languages makes it hard to understand.

@fulmicoton
Copy link

Yes it is the same JVM.
We can try to increase the warmup iter to let it run for one hour, and see if the results change?

Also, do you have an idea of what is the best JVM and the best settings we should use.

@mikemccand
Copy link
Member Author

mikemccand commented Jun 10, 2023

@uschindler I think the benchmark is quite fair at this point, from what I can tell -- we've been iterating on that repo working to pull over luceneutil's enwiki corpus + tasks, ensuring the returned top N hits are very nearly identical, ensuring warmup + N iterations and take the fastest run is happening, measuring time on the server not the client, running both on a single segment index (makes results more comparable but not necessarily realistic to production cases), sprinkle 2% deletes on both indices, statically sorting both indices by the same id field, making max token length 256 (not 255) across both engines, ensuring tantivy is also splitting on full Unicode whitespace not just ascii, etc.

Many thanks to @Tony-X for pushing so hard on this.

It does indeed run a single JVM, but I an of the opposite opinion -- we need to run multiple JVM iterations to see how much hotspot noise is hurting. Darned java/hotspot coddling! The Rust results look quite consistent from one run to the next.

It's also simple to run -- fork/clone the repo and follow the README steps. I'd love to see results from more diverse hardware than just my home beast boxes lol.

Still the mix of several languages makes it hard to understand.

Ha! Java and Python and Rust! (Edit: oh and also Makefile!!)

Start at client.py -- I think it's quite understandable if you launch off from there. More so than luceneutil!!

When Tantitvy says it is 2 times faster then this is also unfair

It's called tantivy. We do have a warmup phase.

Oh -- sorry -- you mean no capitalization? So tantivy not Tantivy? I'll try to spell it correctly going forward.

@mikemccand
Copy link
Member Author

BooleanWeight#count doesn't allow to do this kind of thing currently, because its contract is that it should run in constant-time or so, so that you can compute its result and then ignore it if you can't actually make use of it. E.g. for conjunctions, we call BooleanWeight#count over all clauses one by one and stop when more than one clause returns a results that is different from numDocs, and then we fall back to regular counting with TotalHitCountCollector.

Ahh -- gotchya -- only if the impl will be constant time is it allowed to run, else -1 (also in constant time).

OK so it's a bit tricky to add this opto to Lucene ... we need an API change of some sort.

It's as if when IndexSearcher.count realizes it must fallback to counting each hit individually, it should then call maybe another API, which BQ would implement, to try to do that count more efficiently. Hmm.

@mikemccand
Copy link
Member Author

Also, do you have an idea of what is the best JVM and the best settings we should use.

I think @uschindler is referring to the recent changes to let Lucene use Panama (rough SIMD access from way up in javaland) if the user opts in at runtime. We could test this (add --add-modules jdk.incubator.vector) but I suspect it's unlikely to help since so far Lucene only uses Panama for KNN search, I think?

We should fix the JVM heap size I suppose -- I'll try that to see if it alters results.

@uschindler
Copy link
Contributor

uschindler commented Jun 10, 2023

Hi. It looks like the DoQuery.java code does not do a parallel throughput measurement, but instead it runs all queries in a single thread one after each other with Nanotime before and after (thanks for the fix, Mike). So we measure exactly duration of each query. So we should use also ParallelGC. The default G1GC works better when you hammer a server multithreaded, but if there's only one thread doing queries, ParallelGC is better.

Of course a real world benchmark should also measure throughput by hammering a server with hundreds of parallel queries (many more than there are CPU cores) to saturate all CPU cores. Of course in such throughout scenarios I have seen sometimes single queries taking long time, but you need to also look at percentiles then.

I know that lucene is very good in throughput measurements.

I know this comment goes too far and beyond this issue, but we should really look at other scenarios than measuring the duration a query takes.

About vector: this does not apply here because there's no vector search involved. Still with modern Java version like jdk-20 the warmup time in combination with parallelgc is higher due to tiered compilation.

Please use java 20 for benchmarks to also see benefits from mmap, especially with indexes optimized to one segment. Also enable parallel GC, although it's not real world, but the benchmark isn't, too.

Please do not pass any extra JVM args, except GC and heap size.

@mikemccand
Copy link
Member Author

The default G1GC works better when you hammer a server multithreaded, but if there's only one thread doing queries, ParallelGC is better.

Ahh great catch @uschindler! I just made that change in the benchy.

Hi. It looks like the DoQuery.java code does not do a parallel throughput measurement, but instead it runs all queries in a single thread one after each other with Nanotime before and after (thanks for the fix, Mike)

Right -- we are only measuring single threaded query latency (just like luceneutil).

I agree a red-line test (saturate CPU) is also really important -- we have an issue open for this. It's a bit trickier to do...

Note that neither tantivy nor Lucene is using more than one thread for each query, nor any OS level concurrency like async IO (I think?), so the "lightly loaded single query latency" metric should be fair to compare across both engines.

@uschindler
Copy link
Contributor

@fulmicoton
Copy link

Oh -- sorry -- you mean no capitalization? So tantivy not Tantivy? I'll try to spell it correctly going forward.

@mikemccand No no. There is no specific capitalization :). I was correcting tantivity -> tantivy.

@uschindler
Copy link
Contributor

Oh -- sorry -- you mean no capitalization? So tantivy not Tantivy? I'll try to spell it correctly going forward.

@mikemccand No no. There is no specific capitalization :). I was correcting tantivity -> tantivy.

too late in evening and on smartphone. Whatever Google decided to be the correct term :-)

@mikemccand
Copy link
Member Author

Oh -- sorry -- you mean no capitalization? So tantivy not Tantivy? I'll try to spell it correctly going forward.

@mikemccand No no. There is no specific capitalization :). I was correcting tantivity -> tantivy.

OH! Ha, OK, phew :)

@mikemccand
Copy link
Member Author

Here is is also missing, that's the code I looked at: https://github.com/Tony-X/search-benchmark-game/blob/4402d42c906830e85d8d79a30ae776f204ade770/engines/lucene-9.5.0/query.sh

Oh yeah that does look scary at first, but that is the sh script to start up the "server" that handles many queries through pipes. It is how client.py launches Lucene's server to run its benchmarks (many queries in one exe invocation).

It's invoked from Lucene's Makefile in that same directory for the make serve target.

@uschindler
Copy link
Contributor

Please use java 20 for benchmarks to also see benefits from mmap, especially with indexes optimized to one segment. Also enable parallel GC, although it's not real world, but the benchmark isn't, too.

Small correction as I noticed you are on Lucene 9.5:

  • for Lucene 9.5, use Java 19
  • for Lucene 9.6, use Java 20

@fulmicoton
Copy link

fulmicoton commented Jun 10, 2023

Of course a real world benchmark should also measure throughput by hammering a server with hundreds of parallel queries (many more than there are CPU cores) to saturate all CPU cores.

The goal of the original benchmark is to identify performance headrooms in tantivy.
That's why it measures different kind of queries, does not deal with multithreading, works over a single segment, and compares "min" and not "mean".
"min" removes a lot of the noise and is for this reason a bit generous for Java...
but it is much better to give us the right insights.

BTW I suspect changing the GC won't have any effect on the min statistics, but it does not hurt to try.

I'd love to see a fair bench of multithreaded search throughput / indexing of course.
Making it fair is however much trickier.

@uschindler
Copy link
Contributor

uschindler commented Jun 10, 2023

BTW I suspect changing the GC won't have any effect on the min statistics, but it does not hurt to try.

The problem is that Java's default GC G1 brings up to 10% decrease in speed due to addition of barriers, but cleans up garbage much faster with shorter pauses. This is fine for multithreaded apps, but not single threaded apps doing wall clock measurements.

@Tony-X
Copy link
Contributor

Tony-X commented Jun 10, 2023

Just caught up on this thread -- the design tenet of the current benchmark game is to measure time taken to do the same work in contention-free environment.

As of now I'm still trying to build trust of the benchmarks so thank you for your evaluation and feedbacks @uschindler !

So far I believe there are doing the "same" work as I have chased down a few tokenization issues. Right now the indexes on both side have --

  • almost "same" tokenization -- split by whitespaces and remove tokens with length >=256
  • same index sort
  • same set of deleted docs (2% in total)
  • single segment

Regarding the JVM here is what we do now

  • warm up the JVM with 6.1k query for each COUNT and TOP_10_COUNT. We could increase the warmup iterations easily here. As I was typing, I already changed warmup iter to 3 and kicked off a run.

Admittedly we haven't looked into playing with different JVM arguments. @mikemccand thanks for creating Tony-X/search-benchmark-game#37 to explore the heap sizes :)

IMO, GC here is less of an issue since we measure the best latency (min) across 10 runs for each query (a slight favor for JVM). The probability that every 10 of 10 run of the same query hit an GC is very tiny.

It would be great to share your insights about an optimal JVM setting for this case.

@Tony-X
Copy link
Contributor

Tony-X commented Jun 10, 2023

https://github.com/Tony-X/search-benchmark-game/blob/a6f71237e84aeb47ee8f5b923f16b760b032e1fc/engines/lucene-9.5.0/query.sh

This is not used in benchmarks. This is more like a debugging tool, e.g. you can run that script and type some queries.

@msokolov
Copy link
Contributor

No no. There is no specific capitalization :). I was correcting tantivity -> tantivy.

Just to keep everyone accurate. the original misspelling was "tantitvy" not "tantivity"! :)

I checked the repository. When Tantitvy says it is 2 times faster ...

@fulmicoton
Copy link

fulmicoton commented Jun 14, 2023

@uschindler I updated the bench of the search benchmarks with:

@uschindler
Copy link
Contributor

Hi, thanks for crosschecking. 1 hour warmup is therefor not changing anything.

Anyways, I'd use a newer JDK like 20.

@zhaih
Copy link
Contributor

zhaih commented Jun 16, 2023

After roughly catching up with the threads, I would like to go back to

+1 to update the Weight#count API contract or introduce a new API to better optimize counting.

So we probably can have two count API, constantCount and fastCount?
Then the contract can become, IS will call constantCount first, then if:

  • it returns -2 then it means there's no possible constant time count, but a possible faster than normal count, then IS will call fastCount to retrieve count;
  • it returns -1 then it means no faster count possible, go back to use query and do count normally?
  • it returns a natural number then it's the count.

Does that makes sense?

@jpountz
Copy link
Contributor

jpountz commented Jun 19, 2023

I was wondering about a possible alternative approach that would consist of handling this optimization more at the collector level. E.g. we could add a new method to LeafCollector to take a set of doc IDs (pseudo code):

void collect(DocIdSet set) {
  // default implementation
  for (int docID : set) {
    collect(docID);
  }
}

Then BooleanScorer could pass a DocIdSet to the collector for each window, and TotalHitCountCollector could override the above method to instead do something like this:

void collect(DocIdSet set) {
  // total hit count implementation
  count += set.size(); // internally uses Long#bitCount rather than iterating bits one by one
}

@mikemccand
Copy link
Member Author

E.g. we could add a new method to LeafCollector to take a set of doc IDs (pseudo code):

I like this idea! Sort of like BulkScorer, which can score/collect N docs at a time.

@zhaih
Copy link
Contributor

zhaih commented Jun 20, 2023

we could add a new method to LeafCollector to take a set of doc IDs (pseudo code)

+1, I think this is a better/cleaner approach

@jpountz
Copy link
Contributor

jpountz commented Jun 20, 2023

I like it better too but I think it's going to need more work, e.g. we will need to remove the Scorable#docID() API to avoid running into cases when the doc ID passed to LeafCollector#collect is not the same as the doc ID exposed by Scorable#docID()?

@zhaih
Copy link
Contributor

zhaih commented Jun 28, 2023

Maybe we need a BulkScorable or something which holds multiple Scorable (or just holds an array of scores) and set the contract that collect(DocIdSet should use BulkScorable but not normal Scorable? (This sounds a little bit hacky tho...)

jpountz added a commit to jpountz/lucene that referenced this issue Jun 30, 2023
`Scorable#docID()` exposes the document that is being collected, which makes it
impossible to bulk-collect multiple documents at once.

Relates apache#12358
@jpountz
Copy link
Contributor

jpountz commented Jun 30, 2023

It felt to me like it would be doable to remove Scorable#docID without introducing a replacement, so I gave it a try at #12407.

jpountz added a commit that referenced this issue Jul 5, 2023
`Scorable#docID()` exposes the document that is being collected, which makes it
impossible to bulk-collect multiple documents at once.

Relates #12358
jpountz added a commit to jpountz/lucene that referenced this issue Jul 5, 2023
This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to
collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by
creating a `DocIdStream` whose `count()` method counts the number of bits that
are set in the bit set of matches in the current window, instead of naively
iterating over all matches.

On wikimedium10m, this yields a ~20% speedup when counting hits for the `title
OR 12` query (2.9M hits).

Relates apache#12358
@jpountz
Copy link
Contributor

jpountz commented Jul 5, 2023

I opened a proof of concept for the idea that I suggested above at #12415.

jpountz added a commit that referenced this issue Aug 11, 2023
This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to
collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by
creating a `DocIdStream` whose `count()` method counts the number of bits that
are set in the bit set of matches in the current window, instead of naively
iterating over all matches.

On wikimedium10m, this yields a ~20% speedup when counting hits for the `title
OR 12` query (2.9M hits).

Relates #12358
jpountz added a commit that referenced this issue Aug 11, 2023
This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to
collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by
creating a `DocIdStream` whose `count()` method counts the number of bits that
are set in the bit set of matches in the current window, instead of naively
iterating over all matches.

On wikimedium10m, this yields a ~20% speedup when counting hits for the `title
OR 12` query (2.9M hits).

Relates #12358
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants