-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Push and pop OutputAccumulator as IntersectTermsEnumFrames are pushed and popped #12900
Conversation
OK I opened #12901 to create a better BWC test revealing these read-time exceptions -- can we do that, first, and then confirm #12900 indeed fixes it, then push the test, then push the fix? I can try to work on creating that test case, but won't have much time until early tomorrow AM. If anyone else wants to crack it, feel free! |
I've confirmed the new (failing) BWC test from #12901 now passes with this PR. I'll review ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gf2121! I left a few head scratching comments ...
@@ -2245,7 +2244,6 @@ public void testFailOpenOldIndex() throws IOException { | |||
// #12895: test on a carefully crafted 9.8.0 index (from a small contiguous subset | |||
// of wikibigall unique terms) that shows the read-time exception of | |||
// IntersectTermsEnum (used by WildcardQuery) | |||
@Ignore("re-enable once we merge #12900") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay :)
@@ -89,6 +89,9 @@ final class IntersectTermsEnumFrame { | |||
|
|||
final ByteArrayDataInput bytesReader = new ByteArrayDataInput(); | |||
|
|||
// Cumulative outputs so far | |||
BytesRef[] outputPrefix; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's sort of strange to have a separate array for tracking the accumulated BytesRef
outputs when this is the purpose of the outputAccumulator
? Wouldn't we be able to push/pop as we push/pop the frames, into the single accumulator, so we don't need make extra arrays as we push deeper into the FST?
We could assert when we pop that the output we are popping exactly matches (.equals()
) the FST.Arc
's output from that frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we be able to push/pop as we push/pop the frames, into the single accumulator, so we don't need make extra arrays as we push deeper into the FST?
I'm not sure. Looking at pushFrame
, it seems like one single frame could be related to more than one FST#Arc
. Maybe we need to record the number of outputs pushed into the accumulator when pushing a frame ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this idea to avoid the array copy. Implemented in e387ec9.
@@ -198,6 +204,7 @@ private IntersectTermsEnumFrame pushFrame(int state) throws IOException { | |||
} | |||
|
|||
f.arc = arc; | |||
f.outputPrefix = outputAccumulator.bytesRefs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm shouldn't we sometimes set this outputPrefix
back to null
? The frames are re-used, so we could push deep into the FST, where there is an outputPrefix
, then pop back out and push into a new part of the FST that does not have an outputPrefix
and illegally share / use the old leftover stale outputPrefix
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
push into a new part of the FST that does not have an outputPrefix and illegally share / use the old leftover stale outputPrefix?
I thought the outputPrefix
will be always overwritten by pushFrame
when we push into a new part of the FST. Did i miss something ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the
outputPrefix
will be always overwritten bypushFrame
when we push into a new part of the FST. Did i miss something ?
Ahh OK you are right -- and we could simply rewrite to empty BytesRef
array which means the same thing as null
. Good.
@@ -495,7 +495,7 @@ public boolean seekExact(BytesRef target) throws IOException { | |||
targetUpto = 0; | |||
outputAccumulator.push(arc.nextFinalOutput()); | |||
currentFrame = pushFrame(arc, 0); | |||
outputAccumulator.pop(); | |||
outputAccumulator.pop(arc.nextFinalOutput()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like another potential bug : If arc.nextFinalOutput() == NO_OUTPUT
, we will not add the output into the accumulator but we pop one out.
Not sure why we have not seen any bug report for this. Maybe our output format protect it implicitly? This is in SegmentTermsEnum
so it could possibly affect merge. I'll dig to see if it is a real bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! And the new check you added in pop
will prevent this bug, right?
But I also think this was (luckily!?) not a bug, because this case is when the STE
is not yet seek'd to any term, so we are starting at the FST entry arc / root block, and the empty string FST input must always have a final output (the "root block"), which will never be NO_OUTPUT
. Are there any other places in STE
where we were blindly popping like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the other places where we "blindly" pop are fine -- we conditionalize by arc.isFinal()
check, and we know FST for BlockTree terms index will always have a nextFinalOutput
on final nodes since it holds the floor data for that terms block. And there there is one other place in seekCeil
where we also rely on empty string having a final output, which is fine. So net/net I don't think this was a bug, but it smelled close to one!
@@ -198,9 +199,11 @@ private IntersectTermsEnumFrame pushFrame(int state) throws IOException { | |||
} | |||
|
|||
f.arc = arc; | |||
f.outputNum = outputAccumulator.outputCount() - initOutputCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK we need this because a single frame might traverse multiple FST arcs, so we need to accumulate possibly / often more than one output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! I like the added safety of confirming on pop
that we are in fact pop
ing what we think we are (except when we pop multiple items). And it's cleaner than the first cut by simple using the output accumulator consistently.
I think we should merge this to main
, 9x
and 99x
and let's let CI chew on it for a bit (day or so) before cutting 9.9.1?
Thanks for review and great advice @mikemccand !
+1. I'll merge and backport soon. |
FWIW I confirmed that this change makes the new test in #12925 pass. |
Relate to : #12895