-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate UnifiedHighlighter #21621
Integrate UnifiedHighlighter #21621
Conversation
Documentation is missing (which is why this is a WIP ;) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some questions/comments but this looks good to me overall.
return snippets; | ||
} | ||
|
||
protected void append(StringBuilder dest, String content, int start, int end) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write some javadoc as to why a custom impl would want to override this?
if (bq.getQuery() instanceof TermQuery) { | ||
prefixQuery.add(((TermQuery) bq.getQuery()).getTerm()); | ||
return prefixQuery; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a different change?
highlighters.register("unified", uh); | ||
highlighters.register("unified_plain", uh); | ||
highlighters.register("unified_postings", uh); | ||
highlighters.register("unified_fvh", uh); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to register the same instance under multiple names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see then we have a switch on the highlighter type in the impl, maybe write a comment about it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The experimental highlighter used another field if you were going to override. It called it "hit_source"
I think. I like that kind of thing better because it makes it more clear that you are only changing where the hits are calculated from not the actual highlighter implementation.
|
||
if (field.fieldOptions().scoreOrdered()) { | ||
//let's sort the snippets by score if needed | ||
CollectionUtil.introSort(snippets, (o1, o2) -> (int) Math.signum(o2.getScore() - o1.getScore())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use Double.compare rather than Math.signum?
@@ -22,6 +22,7 @@ | |||
import org.apache.lucene.analysis.Analyzer; | |||
import org.apache.lucene.search.IndexSearcher; | |||
import org.apache.lucene.search.Query; | |||
import org.apache.lucene.search.highlight.Snippet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover?
1) extract different snippets (instead of a single big string) together with their scores ({@link Snippet}) | ||
2) use the {@link Encoder} implementations that are already used with the other highlighters | ||
*/ | ||
public class CustomPassageFormatter extends PassageFormatter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there no way to re-use the already existing CustomPassageFormatter class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like they have to extend different classes with the same name....
* under the License. | ||
*/ | ||
|
||
package org.apache.lucene.search.uhighlight; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a big fan of uhighlight
as a package name, maybe we can simply use "unifiedhighlight", since we already have "postingshighlight"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had similar thoughts (but were instead considering unifiedhighlighter as a package). Still it seemed too long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to file an issue to LUCENE to rename the package to "unifiedhighlight". It's labelled "@lucene.experimental" and was just released.
@@ -31,6 +31,7 @@ | |||
import org.apache.lucene.index.Term; | |||
import org.apache.lucene.search.IndexSearcher; | |||
import org.apache.lucene.search.Query; | |||
import org.apache.lucene.search.highlight.Snippet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover?
@@ -19,6 +19,7 @@ | |||
|
|||
package org.apache.lucene.search.postingshighlight; | |||
|
|||
import org.apache.lucene.search.highlight.Snippet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover?
I think docs are super important. We already have 3 highlighters and it is hard enough for users to figure out which one to use. Would be nice to come up with reasons to prefer this new implementation over the existing ones etc. |
I expect this one allows you to source the hits from postings while still using fvh-like snippet segmentation. So it works properly on more free form fields. I expect this but I have not confirmed it to be so. |
Well it's the other way around ;). The unified highlighter is a fork of the postings highlighter that can also works with term vectors or plain analysis. The snippet segmentation is exactly the same as the postings one. I only see this highlighter as a generalization of the postings highlighter. After some benchmarking it seems to perform quite well on plain analysis (twice faster than the original plain highlighter on wikipedia text) but the drawbacks are the same than for the original Postings Highlighter. |
Ok then..... |
@jimczi were your benchmarks based on lucene code or elasticsearch code from this PR? |
@javanna elasticsearch code from this PR. Well not exactly a benchmark though, only me playing with some queries ;) |
Does the unified highlighter support fragment length? I tried it out on #9442 and it gives better results than the existing highlighter, but does not appear to limit the length. |
Does this issue apply to the unified highlighter too? #11223 |
The unified highlighter doesn't seem to work with boundary characters but doesn't complain about them not being supported either. See #11777 |
Unless I misunderstand what ES's |
That's correct. What I meant is that LUCENE-7575 is more than |
@elasticmachine retest this please |
I pushed another iteration now that we upgraded to the latest Lucene release. The unified highlighter now supports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. I'm not sure about the "rewrites with empty reader" thing. The _all query looks like it should be fine. I'm not sure about the other one. I'll have to do some more digging.
import org.elasticsearch.search.fetch.subphase.highlight.HighlightUtils; | ||
|
||
/** | ||
Custom passage formatter that allows us to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't valid javadoc. I think we should also document why this has to be in the org.apache.lucene
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could move it to another package, the only reason to put it here is because we already have the custom postings highlighter in the org.apache.lucene
package.
1) extract different snippets (instead of a single big string) together with their scores ({@link Snippet}) | ||
2) use the {@link Encoder} implementations that are already used with the other highlighters | ||
*/ | ||
public class CustomPassageFormatter extends PassageFormatter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like they have to extend different classes with the same name....
|
||
/** | ||
* Subclass of the {@link UnifiedHighlighter} that works for a single field in a single document. | ||
* Uses a custom {@link org.apache.lucene.search.uhighlight.PassageFormatter}. Accepts field content as a constructor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just import PassageFormatter
and remove the package from the {@link
bit.
* Creates a new instance of {@link CustomUnifiedHighlighter} | ||
* | ||
* @param analyzer the analyzer used for the field at index time, used for multi term queries internally | ||
* @param passageFormatter our own {@link org.apache.lucene.search.uhighlight.CustomPassageFormatter} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another spot I wouldn't include the whole package for brevity.
return new MatchNoDocsQuery(); | ||
} | ||
// if the terms does not exist we could return a MatchNoDocsQuery but this would break the unified highlighter | ||
// which rewrites query with an empty reader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just going to make queries again _all
a little less efficient what _all
is disabled? If so I think that is ok.
Thanks @nik9000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I think it is time we get this in so folks can experiment with it.
This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES. This highlighter has multiple modes: * plain: a mode that analyzes the plain text directly * postings: a mode that uses the postings offsets to perform the highlight * fvh: a mode that uses the term vectors to perform the highlighting By default the mode is choosen automatically depending on the type of the field. Currently it supports the following options: * `force_source` * `encoder` * `highlight_query` * `pre_tags and `post_tags`
Thanks @nik9000 ! |
* Integrate UnifiedHighlighter This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES. This highlighter can extract offsets from either postings, term vectors, or via re-analyzing text. The best strategy is picked automatically at query time and depends on the field and the query to highlight.
The current integration doesn't seem to support setting the |
That's correct @fariaintrafind , the UnifiedHighlighter uses a sentence break iterator that does not take the |
I have tested the new highlighter and it works great so far. The only setting that does not seem to work (except from the settings that are not supported yet) is |
@lami02 thanks for trying it out! do you mind opening a new issue with the errors you found? a concise reproduction with API calls will be great. |
This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES.
This highlighter has multiple modes:
Since this is a "unified" highlighter here is the complete list of highlighting features supported or not by this integration:
term_vectors
,offsets
or simplystored
(or extracted from _source).force_source
pre_tags
andpost_tags
Multiple tags are not allowed
encoder
(html and default)number_of_fragments
no_match_size
require_field_match
Requires http://issues.apache.org/jira/browse/LUCENE-7575. Note that LUCENE-7575 is more than just require_field_match since it can select which fields to highlight
matched_fields
fragment_size
Requires http://issues.apache.org/jira/browse/LUCENE-7565
boundary_chars
boundary_max_scan
<b>a</b><b>b</b><b>c</b>
=><b>abc</b>
Fixes #21376