-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corruption read on term dictionaries in Lucene 9.9 #12895
Comments
//cc @gf2121 && @mikemccand |
Here are some exceptions ran into when trying to do multi-term queries with Lucene 9.9 against an index created in 9.8 or before:
Still digging into this to figure out why reading is now corrupted. #12699 has many subtle changes, including a very big one where we don't copy the byte arrays any longer. |
Possibly related: #12631 NOTE: the read corruption doesn't occur when reading from an index created in 9.9. |
Git bisect has confirmed the read corruption occurs with: #12699 |
Ugh -- I'll try to look at this later today. Disappointing that our back compat test specifically for reading 9.8 indices failed to catch this. |
It's also curious that it's not happening w/ 9.9 created indices. #12699 is about optimizing how we accumulate the long output while traversing (reading) the FST block tree terms index. |
@mikemccand I have to use at a minimum: Couple of weird things I noticed in that optimization PR:
This makes me think that each frame should have its own |
This commit updates Lucene to a patched version of 9.9.0 without apache/lucene#12699. See apache/lucene#12895.
This commit updates Lucene to a patched version of 9.9.0 without apache/lucene#12699. See apache/lucene#12895.
I think if a fix for this isn't found early next week, we should consider reverting it. No user should upgrade to Lucene 9.9.0 with this bug. |
I am travelling this weekend and unlikely to make much progress on this until early next week. Maybe we just revert and release 9.9.1 now? |
I opened #12899 to revert. |
Thanks for all the discussion, and so sorry for introducing this bug!
This is actually one of the motivation of this PR. This works because we can not have two same fp encoded before floor data. I think this assumption works well for now, otherwise we will break this assert before meeting these exceptions.
This change is not actually intended. I suspect the bug is caused by this. I'll dig more to confirm.
+1 to revert it. Thanks @javanna for raising the revert PR! |
I've managed to repro (thanks @benwtrent!) and indeed the bug is happening because this assumption is (very, very rarely?) invalid:
I see a case where a single byte prefix I don't think this assumption is valid @gf2121? Because that floor data first contains the file pointer of the on-disk block that this prefix points to (in MSB order as of 9.9, where lots of prefix sharing should happen), so, internal arcs before the final arc are in fact expected to output shared prefix bytes? One thing I am curious about: it's possible that turning off suffix sharing (a separate change: #12722) sidesteps this bug and maybe that's why we are not seeing it with newly created (9.9.0) indices? We could test this by backporting #12722 to 9.8.x SNAPSHOT build and re-build the |
Doh! Nevermind: 9.9.0 does NOT have this change (it is still sharing suffixes in the FST) -- so that is not the reason why a 9.9.0 written index seems to not show this corruption... still digging to understand why! Edit: perhaps it's the LSB -> MSB change in how we encode |
OK I turned back on the "old" way (inefficiently concatenating
Here's the new vs old differ patch I used: |
In each case it looks like the accumulator is missing a single prefix byte, which was set somewhere earlier in the arcs leading to the final arc. |
I have to stop working on this for today ... earliest I can look again is tomorrow AM! I'll try to mull in the background. I still can't explain why 9.9 written indices sidestep the read-time bug. We also need some certainty that this bug is not producing corrupt (somehow) 9.9.0 indices? Note that |
OK I used 9.9.0 to write the |
I thought the 'assumption' here means that we assert the floor data are all stored in the last arc. The whole FST output encoded as As @benwtrent pointed out, we should accumulate from the |
BTW, please do not beat yourself up over this ;) We of course try our best not to introduce bugs when making exciting optimizations, but, we are only human, this block tree code is insanely hairy / complex, especially Remember: in life and software, if you are not making mistakes, you are not trying hard enough! |
OK I think I understand why we see the bug on 9.8.x indices but NOT 9.9.x. indices. And I'm quite sure blast radius of the bug is contained solely to read-time, i.e. newly written 9.9.0 indices are not corrupt. Though I'd really love to see some stronger testing -- I'm hoping a randomly written large enough index might reveal the bug. I'll open a spinoff issue so we can create that. Details: We write FST with <term-prefix,byte[]> pairs, where term-prefix is the prefix of a set of terms sharing one on-disk block. E.g. Now, with 9.8.x, this is LSB encoded: those two flags are in the leading byte, not the last byte. FST will share output byte prefixes when it can, and it does in fact happen (rarely!) that the LSB (6 bits + 2 bit flags) is the same across N term blocks. It can happen if the last 6 bits of their block file pointers happen to be the same. In this case that leading byte (or possibly, even more rarely, bytes) are not all on the last arc, and this PR loses that leading flag-containing byte/bytes in that case, and computes the wrong flag value for @gf2121 indeed the leading fp + 2-bit flags will never be the same across blocks, so this means even if FST is able to share leading prefix byte or two, it will never share that entire Importantly, In the 9.9.x written index, where we instead encode that leading long fp + 2 bits in MSB order, we share tons of prefixes of course (this is why FST got so much smaller), but, we will never share the entire thing, and that last byte contains our bit flags. So at read-time we will always have the right (flag containing) byte. The 9.9.x index is intact, and reading it is buggy yet buggy in a harmless way (throwing away bytes it does not try to use), and also because of how |
Sorry @gf2121 -- that is indeed correct: except for the leading vLong-encoded "fp + 2 bits", the remainder of floor data will always be on the last arc. But that leading vLong has those important flags that we were losing in the LSB encoded case.
+1 -- this is the right fix (to not lose any leading bytes for the FST's output in |
I am with @mikemccand on this @gf2121! The optimization proved valuable in benchmarks and all testing indicated it was good to go. Mistakes happen. I think the bigger thing is that our testing around this part of the code is lacking! It's awesome to see that a solution was found so quickly! |
…he block tree IntersectTermsEnum bug apache#12895
This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes `LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with real-world data. With this change, `TestLucene90PostingsFormat` now exhibits apache#12895.
This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes `LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with real-world data. With this change, `TestLucene90PostingsFormat` now exhibits #12895.
This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes `LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with real-world data. With this change, `TestLucene90PostingsFormat` now exhibits #12895.
This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes `LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with real-world data. With this change, `TestLucene90PostingsFormat` now exhibits #12895.
Closing as all work has been done. |
Description
It seems that #12699 has inadvertantly broken reading term dictionaries created in Lucene 9.8<=.
To replicate a bug, one can index wikibigall with LuceneUtil & Lucene 9.8 & force-merge.
Then attempt to read the created index using a wildcard query:
This will result in a trace similar to below:
We are currently not sure if this effects Lucene 9.9 created indices & reading via Lucene 9.9.
EDIT: This failure does NOT occur for indices created by 9.9 and read by 9.9.
NOTE: This also fails with just a prefix wildcard query. It seems to be all multi-term queries could be affected.
Will provide more example stack traces in issue comments.
Version and environment details
Lucene 9.9 reading Lucene 9.8 indices.
The text was updated successfully, but these errors were encountered: