Store docIds as bitset when doc IDs are strictly sorted and dense [LUCENE-10233] #11269

asfimport · 2021-11-14T18:13:40Z

In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset, and give the IntersectVisitor bulk visiting ability, we can speed up addAll because we can just execute the 'or' logic between the result and the block ids.

Optimization will be triggered when the following conditions are met at the same time:

doc IDs are sorted strictly
max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding too much storage)

I mocked a field that has 10,000,000 docs per value and search it with a 1 term PointInSetQuery, the build scorer time decreased from 151ms to 5ms.

Migrated from LUCENE-10233 by Feng Guo (@gf2121), resolved Dec 03 2021
Attachments: SparseFixedBitSet.png
Pull requests: #438, #499

The text was updated successfully, but these errors were encountered:

asfimport · 2021-11-15T15:36:06Z

Adrien Grand (@jpountz) (migrated from JIRA)

This is an interesting idea!

One drawback of this approach is that we're trying to keep the number of classes that implement oal.util.BitSet at 2, and this would be a 3rd one. I wonder if we could use SparseFixedBitSet instead of this new OffsetBitSet class?

asfimport · 2021-11-15T19:48:21Z

Feng Guo (@gf2121) (migrated from JIRA)

@jpountz Thanks! +1 to remove the OffsetBitSet class, but i find that there has not been a faster implementation for FixedBitSet.or(SparseFixedBitSet), do you mean we should implement it?

In this case, Bitset is clearly in this parttern: [nothing][dense]. So compared with SparseFixedBitSet, I tend to use a more customized way to store if possible . Another way to replace the OffsetBitSet class i can think of is to support a docBase in BitSetIterator, i implement this in the newest commit: #438, i wonder if this approach makes sense to you.

asfimport · 2021-11-16T08:08:54Z

Adrien Grand (@jpountz) (migrated from JIRA)

Indeed, I was thinking of implementing FixedBitSet#or(SparseFixedBitSet).

Your other approach works but I worry that it might be fragile since it requires that callers that extract a BitSet from a BitSetIterator always check the docBase?

asfimport · 2021-11-16T08:34:16Z

Feng Guo (@gf2121) (migrated from JIRA)

Thanks! I'll try SparseFixedBitSet :)

asfimport · 2021-11-16T13:18:18Z

Feng Guo (@gf2121) (migrated from JIRA)

@jpountz Sadly, I found that the implementation of SparseFixedBitSet is much more complicated than I thought, and it seems that the or operation of FixedBitSet and SparseFixedBitSet can not be simplely solved by | . I need to spend some more time to consider how to implement this algorithm, and I'm not sure if that can be as efficient as before.

Considering the new challenges, I made some new [changes|[https://github.com//pull/438/commits/292e4ed43119832d626506336ed61152f733a431]] in my original approach: I added extra 0 words for the bitset in the original method. And i created a new expert interface to get the bitset regardless of the docBase. Generally, users can simply use the old interface to get a bitset because the content represented by the bitset is consistent with the idSetIterator. I wonder if this apporach can solve your worries?

asfimport · 2021-11-16T13:50:32Z

Adrien Grand (@jpountz) (migrated from JIRA)

I like how it makes contracts simpler, but I now worry about the memory usage.

I suspect that the proposal I made wouldn't be as efficient as your initial implementation indeed, but hopefully it can be good enough that it would still be a significant speedup compared to today. The way I'm seeing it, we could add SparseFixedBitSet#fromWords(int maxDoc, int base, long[] words) and FixedBitSet#or(SparseFixedBitSet).

asfimport · 2021-11-17T09:21:02Z

Feng Guo (@gf2121) (migrated from JIRA)

@jpountz Thanks for the guide! Actually, there can be no case run into the 'add extra words ' logic so far since the old usages are all docBase = 0 and the new usage (this issue) is using the expert method and do 'or' with an offset. So what you are worried is about the potential risk that cause too much memory usage right? What if we directly throw an Assert Error if users want to extract the bitset but docBase != 0? This may remove the potential risk.

Anyway, i agree with you that a SparseFixedBitSet seems more suitable for the existing framework and less intrusive, so I implemented in the newest commit. This approach looks good to me too.

asfimport · 2021-11-17T12:39:20Z

Feng Guo (@gf2121) (migrated from JIRA)

@jpountz I execute a low cardinality terms query on baseline/ docbase / SparseFixedBitSet:

Origin: 158ms
Docbase: 6ms
SparseFixedBitSet: 71ms

And this is the script: https://github.com/apache/lucene/blob/2586014ca8575d55ae25fa4b35d8df6d2f8f40f2/lucene/core/src/java/org/apache/lucene/util/bkd/Run.java

I'm trying to find some more optimization for SparseFixedBitSet so that it can perform better.

asfimport · 2021-11-17T14:13:23Z

Feng Guo (@gf2121) (migrated from JIRA)

In addition, here are the commit hashes of the codes:

Docbase : 1e0e10a
SparseFixedBitSet: e016f4d

asfimport · 2021-11-18T09:49:39Z

Feng Guo (@gf2121) (migrated from JIRA)

Hi @jpountz, Would you like to give me some suggestions on how to speed up SparseFixedBitSet?

Here are some of my current thoughts:

This is the cpu profile of SparseFixedBitSet: SparseFixedBitSet.png, most of the cpu is used to constructing SparseFixedBitSet, i think this is because we always new big arrays there. Constructing a single global SparseFixedBitSet and reusing it for each block may help, but global SparseFixedBitSet needs the sgment's maxdoc and maxDoc is not available in the BKDReader. I'm not sure if it is worth to changing the BKDReader constrctor signature for this since BKDReader is a public class.

asfimport · 2021-11-23T04:54:31Z

Feng Guo (@gf2121) (migrated from JIRA)

Hi @jpountz. I'm still trying some optimization for the SparseFixedBitSet way, but i wonder if this approach still makes sense to you? Should i move on?

asfimport · 2021-11-25T17:42:37Z

Adrien Grand (@jpountz) (migrated from JIRA)

Sorry for my lagging response and thanks for trying hard to make things work with SparseFixedBitSet. The performance is indeed a bit disappointing compared to your initial implementation and I can't think of an easy way to make it faster given that the bottleneck seems to be on allocations and that there is no clean way of reusing the SparseFixedBitSet across leaves.

Maybe we should go back to your initial approach (sorry for the back-and-forth!) using an approach that doesn't introduce a new BitSet implementation and introduces a new DocIdSetIterator class instead of reusing BitSetIterator in order to avoid the trap of extracting a BitSet from a BitSetIterator and ignoring the docBase?

asfimport · 2021-11-26T06:48:22Z

Feng Guo (@gf2121) (migrated from JIRA)

@jpountz Thanks a lot for your great suggestion! I really like this idea and have implemented this in the newest commit.

asfimport · 2021-12-01T07:05:38Z

Feng Guo (@gf2121) (migrated from JIRA)

Hi @jpountz! Is there anything else that worries you about this approach? Just tell me and I will try my best to solve it :)

asfimport · 2021-12-01T14:31:18Z

ASF subversion and git services (migrated from JIRA)

Commit 5eb575f in lucene's branch refs/heads/main from gf2121
https://gitbox.apache.org/repos/asf?p=lucene.git;h=5eb575f

LUCENE-10233: Store docIds as bitset to speed up addAll (#438)

asfimport · 2021-12-01T18:09:35Z

Feng Guo (@gf2121) (migrated from JIRA)

Hi @jpountz I find that intersector has some 'inverse' usage. I think we can use AND NOT to speed up these situation, i raised a new PR for this: #499.

asfimport · 2021-12-02T07:34:01Z

ASF subversion and git services (migrated from JIRA)

Commit ffb58f6 in lucene's branch refs/heads/main from Adrien Grand
https://gitbox.apache.org/repos/asf?p=lucene.git;h=ffb58f6

Revert "LUCENE-10233: Store docIds as bitset to speed up addAll (#438)"

This reverts commit 5eb575f.

asfimport · 2021-12-02T07:37:34Z

Adrien Grand (@jpountz) (migrated from JIRA)

This change causes issues with backward-compatibility tests: we have emulation for writing indices with old formats which don't work well with this change due to the change of endianness that happened in 9.0, the longs of the bitset get read in the wrong order.

I reverted to have some time to think about the best way to move forward.

asfimport · 2021-12-02T08:21:42Z

Adrien Grand (@jpountz) (migrated from JIRA)

I opened a PR that changes the way we handle backward compatibility for readLELongs and readLEFloats, which makes this change pass backward compatibility tests: #502.

asfimport · 2021-12-02T14:52:40Z

ASF subversion and git services (migrated from JIRA)

Commit 3d61ff2 in lucene's branch refs/heads/main from gf2121
https://gitbox.apache.org/repos/asf?p=lucene.git;h=3d61ff2

LUCENE-10233: Store docIds as bitset to speed up addAll (#438)

asfimport · 2021-12-02T15:34:03Z

ASF subversion and git services (migrated from JIRA)

Commit dd14424 in lucene's branch refs/heads/branch_9x from gf2121
https://gitbox.apache.org/repos/asf?p=lucene.git;h=dd14424

LUCENE-10233: Store docIds as bitset to speed up addAll (#438)

asfimport · 2021-12-03T08:23:23Z

ASF subversion and git services (migrated from JIRA)

Commit 5f09bb3 in lucene's branch refs/heads/main from gf2121
https://gitbox.apache.org/repos/asf?p=lucene.git;h=5f09bb3

LUCENE-10233: Use AND NOT for inverse intersector (#499)

When docIds are stored as a BitSet, use andNot to speed up collecting them.

asfimport · 2021-12-03T08:27:27Z

ASF subversion and git services (migrated from JIRA)

Commit eadc146 in lucene's branch refs/heads/branch_9x from gf2121
https://gitbox.apache.org/repos/asf?p=lucene.git;h=eadc146

LUCENE-10233: Use AND NOT for inverse intersector (#499)

When docIds are stored as a BitSet, use andNot to speed up collecting them.

asfimport · 2021-12-03T11:32:05Z

Ignacio Vera (@iverase) (migrated from JIRA)

@gf2121 It seems the last change is making CI angry:

https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-9.x/379/

Could you have a look?

asfimport · 2021-12-03T12:07:30Z

Feng Guo (@gf2121) (migrated from JIRA)

@iverase Sorry! I raised a PR to fix this: https://github.com/apache/lucene/pull/512.

I run this 100k times and looks good now.

asfimport · 2021-12-06T06:33:41Z

ASF subversion and git services (migrated from JIRA)

Commit 459388c in lucene's branch refs/heads/main from gf2121
https://gitbox.apache.org/repos/asf?p=lucene.git;h=459388c

LUCENE-10233: fix Unit Test TestFixedBitSet#testAndNot (#512)

Co-authored-by: guofeng.my <guofeng.my@bytedance.com>

asfimport · 2021-12-06T06:35:03Z

ASF subversion and git services (migrated from JIRA)

Commit ebee531 in lucene's branch refs/heads/branch_9x from gf2121
https://gitbox.apache.org/repos/asf?p=lucene.git;h=ebee531

LUCENE-10233: fix Unit Test TestFixedBitSet#testAndNot (#512)

Co-authored-by: guofeng.my <guofeng.my@bytedance.com>

asfimport · 2022-03-22T15:16:30Z

Adrien Grand (@jpountz) (migrated from JIRA)

Closing after the 9.1.0 release.

asfimport closed this as completed Dec 3, 2021

asfimport mentioned this issue Aug 24, 2022

Speed up medium cardinality fields with readLongs and SIMD [LUCENE-10297] #11333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store docIds as bitset when doc IDs are strictly sorted and dense [LUCENE-10233] #11269

Store docIds as bitset when doc IDs are strictly sorted and dense [LUCENE-10233] #11269

asfimport commented Nov 14, 2021

asfimport commented Nov 15, 2021

asfimport commented Nov 15, 2021

asfimport commented Nov 16, 2021

asfimport commented Nov 16, 2021

asfimport commented Nov 16, 2021

asfimport commented Nov 16, 2021

asfimport commented Nov 17, 2021

asfimport commented Nov 17, 2021

asfimport commented Nov 17, 2021

asfimport commented Nov 18, 2021

asfimport commented Nov 23, 2021

asfimport commented Nov 25, 2021

asfimport commented Nov 26, 2021

asfimport commented Dec 1, 2021

asfimport commented Dec 1, 2021

asfimport commented Dec 1, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 3, 2021

asfimport commented Dec 3, 2021

asfimport commented Dec 3, 2021

asfimport commented Dec 3, 2021

asfimport commented Dec 6, 2021

asfimport commented Dec 6, 2021

asfimport commented Mar 22, 2022

Store docIds as bitset when doc IDs are strictly sorted and dense [LUCENE-10233] #11269

Store docIds as bitset when doc IDs are strictly sorted and dense [LUCENE-10233] #11269

Comments

asfimport commented Nov 14, 2021

asfimport commented Nov 15, 2021

asfimport commented Nov 15, 2021

asfimport commented Nov 16, 2021

asfimport commented Nov 16, 2021

asfimport commented Nov 16, 2021

asfimport commented Nov 16, 2021

asfimport commented Nov 17, 2021

asfimport commented Nov 17, 2021

asfimport commented Nov 17, 2021

asfimport commented Nov 18, 2021

asfimport commented Nov 23, 2021

asfimport commented Nov 25, 2021

asfimport commented Nov 26, 2021

asfimport commented Dec 1, 2021

asfimport commented Dec 1, 2021

asfimport commented Dec 1, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 2, 2021

asfimport commented Dec 3, 2021

asfimport commented Dec 3, 2021

asfimport commented Dec 3, 2021

asfimport commented Dec 3, 2021

asfimport commented Dec 6, 2021

asfimport commented Dec 6, 2021

asfimport commented Mar 22, 2022