Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corruption read on term dictionaries in Lucene 9.9 #12895

Closed
benwtrent opened this issue Dec 8, 2023 · 24 comments
Closed

Corruption read on term dictionaries in Lucene 9.9 #12895

benwtrent opened this issue Dec 8, 2023 · 24 comments
Labels
Milestone

Comments

@benwtrent
Copy link
Member

benwtrent commented Dec 8, 2023

Description

It seems that #12699 has inadvertantly broken reading term dictionaries created in Lucene 9.8<=.

To replicate a bug, one can index wikibigall with LuceneUtil & Lucene 9.8 & force-merge.

Then attempt to read the created index using a wildcard query:

    Path path = Paths.get("/data/local/lucene/indices/wikibigall.lucene-main.opt.Lucene90.dvfields.nd6.72652M/index");
    try (FSDirectory dir = FSDirectory.open(path);
        DirectoryReader reader = DirectoryReader.open(dir)) {
      IndexSearcher searcher = new IndexSearcher(reader); 
      searcher.count(new WildcardQuery(new Term("body", "*fo*")));
    }

This will result in a trace similar to below:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for length 3
	at org.apache.lucene.store.ByteArrayDataInput.readByte(ByteArrayDataInput.java:136)
	at org.apache.lucene.store.DataInput.readVInt(DataInput.java:110)
	at org.apache.lucene.store.ByteArrayDataInput.readVInt(ByteArrayDataInput.java:114)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:158)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:149)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:203)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:531)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373)
	at org.apache.lucene.search.MultiTermQueryConstantScoreBlendedWrapper$1.rewriteInner(MultiTermQueryConstantScoreBlendedWrapper.java:111)
	at org.apache.lucene.search.AbstractMultiTermQueryConstantScoreWrapper$RewritingWeight.rewrite(AbstractMultiTermQueryConstantScoreWrapper.java:179)
	at org.apache.lucene.search.AbstractMultiTermQueryConstantScoreWrapper$RewritingWeight.bulkScorer(AbstractMultiTermQueryConstantScoreWrapper.java:220)
	at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:930)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:678)
	at org.apache.lucene.search.IndexSearcher.lambda$4(IndexSearcher.java:636)
	at org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$0(TaskExecutor.java:118)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at org.apache.lucene.search.TaskExecutor$TaskGroup.invokeAll(TaskExecutor.java:153)
	at org.apache.lucene.search.TaskExecutor.invokeAll(TaskExecutor.java:76)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:640)
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:607)
	at org.apache.lucene.search.IndexSearcher.count(IndexSearcher.java:423)
	at Corruption.main(Corruption.java:18)

We are currently not sure if this effects Lucene 9.9 created indices & reading via Lucene 9.9.

EDIT: This failure does NOT occur for indices created by 9.9 and read by 9.9.

NOTE: This also fails with just a prefix wildcard query. It seems to be all multi-term queries could be affected.

Will provide more example stack traces in issue comments.

Version and environment details

Lucene 9.9 reading Lucene 9.8 indices.

@benwtrent
Copy link
Member Author

//cc @gf2121 && @mikemccand

@benwtrent
Copy link
Member Author

Here are some exceptions ran into when trying to do multi-term queries with Lucene 9.9 against an index created in 9.8 or before:

Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 4 out of bounds for length 4
	at org.apache.lucene.store.ByteArrayDataInput.readVInt(ByteArrayDataInput.java:112)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:158)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:149)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:203)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:531)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373)
	at org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.next(FilterLeafReader.java:201)
Caused by: org.apache.lucene.index.CorruptIndexException: Illegal code for a compression algorithm: 3 (resource=SearchIndexInput{[(clone of) _3e4i_Lucene90_0.tim], context=IOContext [context=READ, mergeInfo=null, flushInfo=null, readOnce=false], cacheFile=SharedCacheFile{cacheKey=FileCacheKey[shardId=[.ds-logs-endpoint.events.file-default-2023.11.22-000002][0], primaryTerm=3, fileName=stateless_commit_61353], length=3174808501}, length=308358826, offset=181596})
	at org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.load(IntersectTermsEnumFrame.java:196)
	at org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnumFrame.loadNextFloorBlock(IntersectTermsEnumFrame.java:119)
	at org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.popPushNext(IntersectTermsEnum.java:339)
	at org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:543)
	at org.apache.lucene.core@9.9.0/org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373)
	at org.apache.lucene.core@9.9.0/org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.next(FilterLeafReader.java:201)
Caused by: java.lang.NullPointerException: Cannot invoke "org.apache.lucene.util.fst.FST$Arc.output()" because "arc" is null
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.pushFrame(IntersectTermsEnum.java:196)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum._next(IntersectTermsEnum.java:531)
	at org.apache.lucene.codecs.lucene90.blocktree.IntersectTermsEnum.next(IntersectTermsEnum.java:373)
	at org.apache.lucene.index.FilterLeafReader$FilterTermsEnum.next(FilterLeafReader.java:201)

Still digging into this to figure out why reading is now corrupted. #12699 has many subtle changes, including a very big one where we don't copy the byte arrays any longer.

@benwtrent
Copy link
Member Author

Possibly related: #12631

NOTE: the read corruption doesn't occur when reading from an index created in 9.9.

@benwtrent
Copy link
Member Author

Git bisect has confirmed the read corruption occurs with: #12699

@mikemccand
Copy link
Member

Ugh -- I'll try to look at this later today. Disappointing that our back compat test specifically for reading 9.8 indices failed to catch this.

@mikemccand
Copy link
Member

It's also curious that it's not happening w/ 9.9 created indices. #12699 is about optimizing how we accumulate the long output while traversing (reading) the FST block tree terms index.

@benwtrent
Copy link
Member Author

@mikemccand I have to use at a minimum: wikibig1m for it to replicate.

Couple of weird things I noticed in that optimization PR:

This makes me think that each frame should have its own OutputAccumulator But all this code seems incredibly subtle and opaque to me. Been staring at it for hours now :)

rjernst added a commit to rjernst/elasticsearch that referenced this issue Dec 8, 2023
This commit updates Lucene to a patched version of 9.9.0 without
apache/lucene#12699. See
apache/lucene#12895.
elasticsearchmachine pushed a commit to elastic/elasticsearch that referenced this issue Dec 9, 2023
This commit updates Lucene to a patched version of 9.9.0 without
apache/lucene#12699. See
apache/lucene#12895.
@benwtrent
Copy link
Member Author

I think if a fix for this isn't found early next week, we should consider reverting it.

No user should upgrade to Lucene 9.9.0 with this bug.

@mikemccand
Copy link
Member

I am travelling this weekend and unlikely to make much progress on this until early next week.

Maybe we just revert and release 9.9.1 now?

@javanna
Copy link
Contributor

javanna commented Dec 9, 2023

I opened #12899 to revert.

@gf2121
Copy link
Contributor

gf2121 commented Dec 9, 2023

Thanks for all the discussion, and so sorry for introducing this bug!

https://github.com/apache/lucene/pull/12699/files#diff-c30add12e4170e7905cdb0912fd869d447b5b1772d74c9fcaf0afe41cd58866aL200-R204 Now instead of giving a complete view of all appended arcs, there is this weird assumption that we only give one to floorDataReader

This is actually one of the motivation of this PR. This works because we can not have two same fp encoded before floor data. I think this assumption works well for now, otherwise we will break this assert before meeting these exceptions.

https://github.com/apache/lucene/pull/12699/files#diff-c30add12e4170e7905cdb0912fd869d447b5b1772d74c9fcaf0afe41cd58866aR200-L202. The frame outputPrefix is then set as the accumulation of all the arc outputs. I am not sure if this is relevant as for it to be important getFrame would have to return a frame already seen before, I am not sure if that is valid or not.

This change is not actually intended. I suspect the bug is caused by this. I'll dig more to confirm.

Maybe we just revert and release 9.9.1 now?

+1 to revert it. Thanks @javanna for raising the revert PR!
BTW I wonder if we can only revert the part of IntersectTermsEnum and keep the change in SegmentTermsEnum, as it seems all exceptions are found in IntersectTermsEnum.

@mikemccand
Copy link
Member

I've managed to repro (thanks @benwtrent!) and indeed the bug is happening because this assumption is (very, very rarely?) invalid:

This works because we can not have two same fp encoded before floor data. I think this assumption works well for now, otherwise we will break this assert before meeting these exceptions.

I see a case where a single byte prefix [0xfe] appears before the final arc that has the rest of the floor/frame data. So the accumulator fails to put this 0xfe first and readVLong then fails. This is on a 9.8.0 written index where the vLong is still LSB encoded.

I don't think this assumption is valid @gf2121? Because that floor data first contains the file pointer of the on-disk block that this prefix points to (in MSB order as of 9.9, where lots of prefix sharing should happen), so, internal arcs before the final arc are in fact expected to output shared prefix bytes?

One thing I am curious about: it's possible that turning off suffix sharing (a separate change: #12722) sidesteps this bug and maybe that's why we are not seeing it with newly created (9.9.0) indices? We could test this by backporting #12722 to 9.8.x SNAPSHOT build and re-build the wikibigall and see if the corruption still happens. I'm not saying this is a workaround or anything but it'd make me more comfortable if we could understand why 9.9.0 written indices are not corrupt. Alternatively, we could revert #12722 in 9.9.x, rebuild wikibigall, and see if the bug then happens? I won't be able to try this likely for a day or two so if someone who can repro the bug could test this that would be awesome :). Thanks!

@mikemccand
Copy link
Member

mikemccand commented Dec 9, 2023

One thing I am curious about: it's possible that turning off suffix sharing (a separate change: #12722) sidesteps this bug and maybe that's why we are not seeing it with newly created (9.9.0) indices?

Doh! Nevermind: 9.9.0 does NOT have this change (it is still sharing suffixes in the FST) -- so that is not the reason why a 9.9.0 written index seems to not show this corruption... still digging to understand why!

Edit: perhaps it's the LSB -> MSB change in how we encode fp to the term block, though, from what I understand so far, that should've made corruption more likely, not less. So confused still...

@mikemccand
Copy link
Member

OK I turned back on the "old" way (inefficiently concatenating BytesRef as we visit arcs), and then diff'd old and new way when pushing the frame, then re-ran the *fo* WildcardQuery. It uncovers a handful of cases where the accumulator is wrong:

IR: StandardDirectoryReader(segments_3:1256 _hc(9.8.1):C3978548:[diagnostics={mergeMaxNumSegments=1, timestamp=1702127242522, os.version=6.4.11-arch2-1, os=Linux, java.vendor=N/A, mergeFactor=4, java.runtime.version=21+35, os.arch=amd64, source=merge, lucene.version=9.8.1}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=atd8lnurk6fe2ukuaobcgg5l7)
DIFF: oldWay=[9e c3 a8 1c] newWay=[c3 a8 1c]
DIFF: oldWay=[ae a7 ba 1f] newWay=[a7 ba 1f]
DIFF: oldWay=[a6 b3 97 2b] newWay=[b3 97 2b]
DIFF: oldWay=[82 c4 b2 4c] newWay=[c4 b2 4c]
DIFF: oldWay=[af 8b ba 4c 1 6b b7 4] newWay=[8b ba 4c 1 6b b7 4]
DIFF: oldWay=[c6 92 a7 5a] newWay=[92 a7 5a]
DIFF: oldWay=[e2 a0 db 66] newWay=[a0 db 66]
DIFF: oldWay=[a2 a2 ec 7c] newWay=[a2 ec 7c]
DIFF: oldWay=[9e e6 e0 82 1] newWay=[e6 e0 82 1]
DIFF: oldWay=[b6 aa e3 82 1] newWay=[aa e3 82 1]
DIFF: oldWay=[ee ab 92 8b 1] newWay=[ab 92 8b 1]
DIFF: oldWay=[f2 e2 f5 8b 1] newWay=[e2 f5 8b 1]
DIFF: oldWay=[e2 c3 e5 8d 1] newWay=[c3 e5 8d 1]
DIFF: oldWay=[82 d4 e0 8e 1] newWay=[d4 e0 8e 1]
count for body:"*fo*" is 2871441 vs slow=58963 total_term_count=5759190

Here's the new vs old differ patch I used:

12895-debug.patch

@mikemccand
Copy link
Member

In each case it looks like the accumulator is missing a single prefix byte, which was set somewhere earlier in the arcs leading to the final arc.

@mikemccand
Copy link
Member

I have to stop working on this for today ... earliest I can look again is tomorrow AM! I'll try to mull in the background. I still can't explain why 9.9 written indices sidestep the read-time bug.

We also need some certainty that this bug is not producing corrupt (somehow) 9.9.0 indices? Note that SegmentTermsEnum is used during indexing when merging segments ... so if the bug is truly contained to IntersectTermsEnum, then we are fine (that is truly only search time).

@mikemccand
Copy link
Member

OK I used 9.9.0 to write the wikibigall index, and confirmed that WildcardQuery *fo* runs and seems to produce the correct results, yet, there are a great many DIFF outputs, curiously. Maybe the difference between old and new BytesRef floor date are somehow, sometimes, harmless? So confused...

@gf2121
Copy link
Contributor

gf2121 commented Dec 10, 2023

I don't think this assumption is valid @gf2121? Because that floor data first contains the file pointer of the on-disk block that this prefix points to (in MSB order as of 9.9, where lots of prefix sharing should happen), so, internal arcs before the final arc are in fact expected to output shared prefix bytes?

I thought the 'assumption' here means that we assert the floor data are all stored in the last arc. The whole FST output encoded as [ MSBVLong | floordata ]. We may share prefixes in MSBVLong, but we can not have two output having same MSBVLong so floordata will never be splitted into more than one arcs. Did i misunderstand something?

As @benwtrent pointed out, we should accumulate from the outputPrefix instead of arc.output. I raised #12900 for this. This patch seems to fix the exception when searching WildcardQuery(new Term("body", "*fo*")) on Wikibig1m. I'll tryWikibigall as well.

@mikemccand
Copy link
Member

and so sorry for introducing this bug!

BTW, please do not beat yourself up over this ;)

We of course try our best not to introduce bugs when making exciting optimizations, but, we are only human, this block tree code is insanely hairy / complex, especially IntersectTermsEnum (intersecting FST index + on-disk blocks with Automaton), and we have tests (which are inadequate now) that we improve every time we hit situations like this :)

Remember: in life and software, if you are not making mistakes, you are not trying hard enough!

@mikemccand
Copy link
Member

OK I think I understand why we see the bug on 9.8.x indices but NOT 9.9.x. indices. And I'm quite sure blast radius of the bug is contained solely to read-time, i.e. newly written 9.9.0 indices are not corrupt. Though I'd really love to see some stronger testing -- I'm hoping a randomly written large enough index might reveal the bug. I'll open a spinoff issue so we can create that.

Details:

We write FST with <term-prefix,byte[]> pairs, where term-prefix is the prefix of a set of terms sharing one on-disk block. E.g. foo is prefix, and the N terms would be foobar, foobaz, ... The start of that byte[] is a vLong, which is the absolute file pointer of the on-disk block shifted up by 2 bits, and 2 lower bits are two important flags: OUTPUT_FLAG_IS_FLOOR and OUTPUT_FLAG_HAS_TERMS.

Now, with 9.8.x, this is LSB encoded: those two flags are in the leading byte, not the last byte. FST will share output byte prefixes when it can, and it does in fact happen (rarely!) that the LSB (6 bits + 2 bit flags) is the same across N term blocks. It can happen if the last 6 bits of their block file pointers happen to be the same. In this case that leading byte (or possibly, even more rarely, bytes) are not all on the last arc, and this PR loses that leading flag-containing byte/bytes in that case, and computes the wrong flag value for OUTPUT_FLAG_IS_FLOOR, then tries to readVInt when non was written --> exceptions at read time.

@gf2121 indeed the leading fp + 2-bit flags will never be the same across blocks, so this means even if FST is able to share leading prefix byte or two, it will never share that entire vLong: there must be at least one final (different) byte for readVLong to read. And because vLong is self-delineating (each byte knows whether it is the last one by checking 8th bit is clear) losing a byte or two of its leading bytes won't break reading of the vLong, i.e., readVLong will indeed stop at the right byte (but of course produce the wrong value).

Importantly, IntersectTermsEnum does not use this encoded file pointer in the floorData! It gets the file-pointer by traversing prior arcs in the FST and accumulating N incremental diffs contained in the on-disk block entries leading to that final floor block. So IntersectTermsEnum just needs these two big flags.

In the 9.9.x written index, where we instead encode that leading long fp + 2 bits in MSB order, we share tons of prefixes of course (this is why FST got so much smaller), but, we will never share the entire thing, and that last byte contains our bit flags. So at read-time we will always have the right (flag containing) byte. The 9.9.x index is intact, and reading it is buggy yet buggy in a harmless way (throwing away bytes it does not try to use), and also because of how vLong is self delineating. Fun!

@mikemccand
Copy link
Member

I don't think this assumption is valid @gf2121? Because that floor data first contains the file pointer of the on-disk block that this prefix points to (in MSB order as of 9.9, where lots of prefix sharing should happen), so, internal arcs before the final arc are in fact expected to output shared prefix bytes?

I thought the 'assumption' here means that we assert the floor data are all stored in the last arc. The whole FST output encoded as [ MSBVLong | floordata ]. We may share prefixes in MSBVLong, but we can not have two output having same MSBVLong so floordata will never be splitted into more than one arcs. Did i misunderstand something?

Sorry @gf2121 -- that is indeed correct: except for the leading vLong-encoded "fp + 2 bits", the remainder of floor data will always be on the last arc. But that leading vLong has those important flags that we were losing in the LSB encoded case.

As @benwtrent pointed out, we should accumulate from the outputPrefix instead of arc.output. I raised #12900 for this. This patch seems to fix the exception when searching WildcardQuery(new Term("body", "*fo*")) on Wikibig1m. I'll tryWikibigall as well.

+1 -- this is the right fix (to not lose any leading bytes for the FST's output in IntersectTermsEnum). I'll review the PR and open followon issue to somehow expose the bug with stronger BWC test.

@mikemccand
Copy link
Member

OK I opened #12901 to create a test -- can we do that, first, and then confirm #12900 indeed fixes it, then push the test, then push the fix? I can try to work on creating that test case, but won't have much time until early tomorrow AM. If anyone else wants to crack it, feel free!

@benwtrent
Copy link
Member Author

and so sorry for introducing this bug!

I am with @mikemccand on this @gf2121!

The optimization proved valuable in benchmarks and all testing indicated it was good to go. Mistakes happen. I think the bigger thing is that our testing around this part of the code is lacking!

It's awesome to see that a solution was found so quickly!

mikemccand added a commit to mikemccand/lucene that referenced this issue Dec 11, 2023
mikemccand added a commit that referenced this issue Dec 11, 2023
…ck tree IntersectTermsEnum bug #12895 (#12913)

* #12901: add TestBackwardsCompatibility test case that reveals the block tree IntersectTermsEnum bug #12895

* woops, forgot to tidy up

* #12901: Ignore failing test; reflow text to workaround spotless' poor text formatting skills
mikemccand added a commit that referenced this issue Dec 11, 2023
…ck tree IntersectTermsEnum bug #12895 (#12913)

* #12901: add TestBackwardsCompatibility test case that reveals the block tree IntersectTermsEnum bug #12895

* woops, forgot to tidy up

* #12901: Ignore failing test; reflow text to workaround spotless' poor text formatting skills
mikemccand added a commit that referenced this issue Dec 11, 2023
…ck tree IntersectTermsEnum bug #12895 (#12913)

* #12901: add TestBackwardsCompatibility test case that reveals the block tree IntersectTermsEnum bug #12895

* woops, forgot to tidy up

* #12901: Ignore failing test; reflow text to workaround spotless' poor text formatting skills
jpountz added a commit to jpountz/lucene that referenced this issue Dec 12, 2023
This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes
`LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with
real-world data.

With this change, `TestLucene90PostingsFormat` now exhibits apache#12895.
jpountz added a commit that referenced this issue Dec 12, 2023
This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes
`LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with
real-world data.

With this change, `TestLucene90PostingsFormat` now exhibits #12895.
jpountz added a commit that referenced this issue Dec 12, 2023
This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes
`LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with
real-world data.

With this change, `TestLucene90PostingsFormat` now exhibits #12895.
jpountz added a commit that referenced this issue Dec 12, 2023
This commit adds coverage to `Terms#intersect` to `CheckIndex` and indexes
`LineFileDocs` in `BasePostingsFormatTestCase` to get some coverage with
real-world data.

With this change, `TestLucene90PostingsFormat` now exhibits #12895.
@ChrisHegarty
Copy link
Contributor

Closing as all work has been done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants