-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store docIds as bitset when doc IDs are strictly sorted and dense [LUCENE-10233] #11269
Comments
Adrien Grand (@jpountz) (migrated from JIRA) This is an interesting idea! One drawback of this approach is that we're trying to keep the number of classes that implement oal.util.BitSet at 2, and this would be a 3rd one. I wonder if we could use SparseFixedBitSet instead of this new OffsetBitSet class? |
Feng Guo (@gf2121) (migrated from JIRA) @jpountz Thanks! +1 to remove the OffsetBitSet class, but i find that there has not been a faster implementation for FixedBitSet.or(SparseFixedBitSet), do you mean we should implement it? In this case, Bitset is clearly in this parttern: [nothing][dense]. So compared with SparseFixedBitSet, I tend to use a more customized way to store if possible . Another way to replace the OffsetBitSet class i can think of is to support a docBase in BitSetIterator, i implement this in the newest commit: #438, i wonder if this approach makes sense to you. |
Adrien Grand (@jpountz) (migrated from JIRA) Indeed, I was thinking of implementing Your other approach works but I worry that it might be fragile since it requires that callers that extract a |
Feng Guo (@gf2121) (migrated from JIRA) Thanks! I'll try SparseFixedBitSet :) |
Feng Guo (@gf2121) (migrated from JIRA) @jpountz Sadly, I found that the implementation of Considering the new challenges, I made some new [changes|[https://github.com//pull/438/commits/292e4ed43119832d626506336ed61152f733a431]] in my original approach: I added extra 0 words for the bitset in the original method. And i created a new expert interface to get the bitset regardless of the docBase. Generally, users can simply use the old interface to get a bitset because the content represented by the bitset is consistent with the idSetIterator. I wonder if this apporach can solve your worries? |
Adrien Grand (@jpountz) (migrated from JIRA) I like how it makes contracts simpler, but I now worry about the memory usage. I suspect that the proposal I made wouldn't be as efficient as your initial implementation indeed, but hopefully it can be good enough that it would still be a significant speedup compared to today. The way I'm seeing it, we could add |
Feng Guo (@gf2121) (migrated from JIRA) @jpountz Thanks for the guide! Actually, there can be no case run into the 'add extra words ' logic so far since the old usages are all docBase = 0 and the new usage (this issue) is using the expert method and do 'or' with an offset. So what you are worried is about the potential risk that cause too much memory usage right? What if we directly throw an Assert Error if users want to extract the bitset but docBase != 0? This may remove the potential risk. Anyway, i agree with you that a SparseFixedBitSet seems more suitable for the existing framework and less intrusive, so I implemented in the newest commit. This approach looks good to me too. |
Feng Guo (@gf2121) (migrated from JIRA) @jpountz I execute a low cardinality terms query on baseline/ docbase / SparseFixedBitSet: Origin: 158ms And this is the script: https://github.com/apache/lucene/blob/2586014ca8575d55ae25fa4b35d8df6d2f8f40f2/lucene/core/src/java/org/apache/lucene/util/bkd/Run.java
I'm trying to find some more optimization for SparseFixedBitSet so that it can perform better. |
Feng Guo (@gf2121) (migrated from JIRA) In addition, here are the commit hashes of the codes: |
Feng Guo (@gf2121) (migrated from JIRA) Hi @jpountz, Would you like to give me some suggestions on how to speed up SparseFixedBitSet? Here are some of my current thoughts: This is the cpu profile of SparseFixedBitSet: SparseFixedBitSet.png, most of the cpu is used to constructing SparseFixedBitSet, i think this is because we always new big arrays there. Constructing a single global SparseFixedBitSet and reusing it for each block may help, but global SparseFixedBitSet needs the sgment's maxdoc and maxDoc is not available in the BKDReader. I'm not sure if it is worth to changing the BKDReader constrctor signature for this since BKDReader is a public class. |
Feng Guo (@gf2121) (migrated from JIRA) Hi @jpountz. I'm still trying some optimization for the SparseFixedBitSet way, but i wonder if this approach still makes sense to you? Should i move on? |
Adrien Grand (@jpountz) (migrated from JIRA) Sorry for my lagging response and thanks for trying hard to make things work with SparseFixedBitSet. The performance is indeed a bit disappointing compared to your initial implementation and I can't think of an easy way to make it faster given that the bottleneck seems to be on allocations and that there is no clean way of reusing the SparseFixedBitSet across leaves. Maybe we should go back to your initial approach (sorry for the back-and-forth!) using an approach that doesn't introduce a new BitSet implementation and introduces a new DocIdSetIterator class instead of reusing BitSetIterator in order to avoid the trap of extracting a BitSet from a BitSetIterator and ignoring the docBase? |
Feng Guo (@gf2121) (migrated from JIRA) @jpountz Thanks a lot for your great suggestion! I really like this idea and have implemented this in the newest commit. |
Feng Guo (@gf2121) (migrated from JIRA) Hi @jpountz! Is there anything else that worries you about this approach? Just tell me and I will try my best to solve it :) |
ASF subversion and git services (migrated from JIRA) Commit 5eb575f in lucene's branch refs/heads/main from gf2121 LUCENE-10233: Store docIds as bitset to speed up addAll (#438) |
Feng Guo (@gf2121) (migrated from JIRA) Hi @jpountz I find that |
ASF subversion and git services (migrated from JIRA) Commit ffb58f6 in lucene's branch refs/heads/main from Adrien Grand Revert "LUCENE-10233: Store docIds as bitset to speed up addAll (#438)" This reverts commit 5eb575f. |
Adrien Grand (@jpountz) (migrated from JIRA) This change causes issues with backward-compatibility tests: we have emulation for writing indices with old formats which don't work well with this change due to the change of endianness that happened in 9.0, the longs of the bitset get read in the wrong order. I reverted to have some time to think about the best way to move forward. |
Adrien Grand (@jpountz) (migrated from JIRA) I opened a PR that changes the way we handle backward compatibility for readLELongs and readLEFloats, which makes this change pass backward compatibility tests: #502. |
ASF subversion and git services (migrated from JIRA) Commit 3d61ff2 in lucene's branch refs/heads/main from gf2121 LUCENE-10233: Store docIds as bitset to speed up addAll (#438) |
ASF subversion and git services (migrated from JIRA) Commit dd14424 in lucene's branch refs/heads/branch_9x from gf2121 LUCENE-10233: Store docIds as bitset to speed up addAll (#438) |
ASF subversion and git services (migrated from JIRA) Commit 5f09bb3 in lucene's branch refs/heads/main from gf2121 LUCENE-10233: Use AND NOT for inverse intersector (#499) When docIds are stored as a BitSet, use andNot to speed up collecting them. |
ASF subversion and git services (migrated from JIRA) Commit eadc146 in lucene's branch refs/heads/branch_9x from gf2121 LUCENE-10233: Use AND NOT for inverse intersector (#499) When docIds are stored as a BitSet, use andNot to speed up collecting them. |
Ignacio Vera (@iverase) (migrated from JIRA) @gf2121 It seems the last change is making CI angry: https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-9.x/379/
Could you have a look? |
Feng Guo (@gf2121) (migrated from JIRA) @iverase Sorry! I raised a PR to fix this: https://github.com/apache/lucene/pull/512. I run this 100k times and looks good now. |
ASF subversion and git services (migrated from JIRA) Commit 459388c in lucene's branch refs/heads/main from gf2121 LUCENE-10233: fix Unit Test TestFixedBitSet#testAndNot (#512) Co-authored-by: guofeng.my <guofeng.my@bytedance.com> |
ASF subversion and git services (migrated from JIRA) Commit ebee531 in lucene's branch refs/heads/branch_9x from gf2121 LUCENE-10233: fix Unit Test TestFixedBitSet#testAndNot (#512) Co-authored-by: guofeng.my <guofeng.my@bytedance.com> |
Adrien Grand (@jpountz) (migrated from JIRA) Closing after the 9.1.0 release. |
In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and
intersect
will get intoaddAll
logic. If we store ids as bitset, and give the IntersectVisitor bulk visiting ability, we can speed up addAll because we can just execute the 'or' logic between the result and the block ids.Optimization will be triggered when the following conditions are met at the same time:
I mocked a field that has 10,000,000 docs per value and search it with a 1 term PointInSetQuery, the build scorer time decreased from 151ms to 5ms.
Migrated from LUCENE-10233 by Feng Guo (@gf2121), resolved Dec 03 2021
Attachments: SparseFixedBitSet.png
Pull requests: #438, #499
The text was updated successfully, but these errors were encountered: