Integrate UnifiedHighlighter #21621

jimczi · 2016-11-17T12:22:07Z

This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES.
This highlighter has multiple modes:

plain: a mode that analyzes the plain text directly
postings: a mode that uses the postings offsets to perform the highlight
fvh: a mode that uses the term vectors to perform the highlighting

Since this is a "unified" highlighter here is the complete list of highlighting features supported or not by this integration:

Fixes #21376

jimczi · 2016-11-17T12:25:01Z

Documentation is missing (which is why this is a WIP ;) )

jpountz

I left some questions/comments but this looks good to me overall.

jpountz · 2016-11-17T14:40:15Z

core/src/main/java/org/apache/lucene/search/uhighlight/CustomPassageFormatter.java

+        return snippets;
+    }
+
+    protected void append(StringBuilder dest, String content, int start, int end) {


Can you write some javadoc as to why a custom impl would want to override this?

jpountz · 2016-11-17T14:44:01Z

core/src/main/java/org/elasticsearch/index/search/MatchQuery.java

+                if (bq.getQuery() instanceof TermQuery) {
+                    prefixQuery.add(((TermQuery) bq.getQuery()).getTerm());
+                    return prefixQuery;
+                }


this looks like a different change?

jpountz · 2016-11-17T14:44:38Z

core/src/main/java/org/elasticsearch/search/SearchModule.java

+        highlighters.register("unified", uh);
+        highlighters.register("unified_plain", uh);
+        highlighters.register("unified_postings", uh);
+        highlighters.register("unified_fvh", uh);


why do we need to register the same instance under multiple names?

oh I see then we have a switch on the highlighter type in the impl, maybe write a comment about it here?

The experimental highlighter used another field if you were going to override. It called it "hit_source" I think. I like that kind of thing better because it makes it more clear that you are only changing where the hits are calculated from not the actual highlighter implementation.

jpountz · 2016-11-17T14:52:55Z

core/src/main/java/org/elasticsearch/search/fetch/subphase/highlight/UnifiedHighlighter.java

+
+        if (field.fieldOptions().scoreOrdered()) {
+            //let's sort the snippets by score if needed
+            CollectionUtil.introSort(snippets, (o1, o2) -> (int) Math.signum(o2.getScore() - o1.getScore()));


maybe use Double.compare rather than Math.signum?

javanna · 2016-11-17T17:17:27Z

core/src/main/java/org/apache/lucene/search/postingshighlight/CustomPostingsHighlighter.java

@@ -22,6 +22,7 @@
 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.search.IndexSearcher;
 import org.apache.lucene.search.Query;
+import org.apache.lucene.search.highlight.Snippet;


javanna · 2016-11-17T17:18:45Z

core/src/main/java/org/apache/lucene/search/uhighlight/CustomPassageFormatter.java

+1) extract different snippets (instead of a single big string) together with their scores ({@link Snippet})
+2) use the {@link Encoder} implementations that are already used with the other highlighters
+ */
+public class CustomPassageFormatter extends PassageFormatter {


is there no way to re-use the already existing CustomPassageFormatter class?

Looks like they have to extend different classes with the same name....

javanna · 2016-11-17T17:19:33Z

core/src/main/java/org/apache/lucene/search/uhighlight/CustomPassageFormatter.java

+ * under the License.
+ */
+
+package org.apache.lucene.search.uhighlight;


I am not a big fan of uhighlight as a package name, maybe we can simply use "unifiedhighlight", since we already have "postingshighlight"

We had similar thoughts (but were instead considering unifiedhighlighter as a package). Still it seemed too long.

Feel free to file an issue to LUCENE to rename the package to "unifiedhighlight". It's labelled "@lucene.experimental" and was just released.

javanna · 2016-11-17T17:21:15Z

...src/test/java/org/apache/lucene/search/postingshighlight/CustomPostingsHighlighterTests.java

@@ -31,6 +31,7 @@
 import org.apache.lucene.index.Term;
 import org.apache.lucene.search.IndexSearcher;
 import org.apache.lucene.search.Query;
+import org.apache.lucene.search.highlight.Snippet;


javanna · 2016-11-17T17:21:24Z

core/src/test/java/org/apache/lucene/search/postingshighlight/CustomPassageFormatterTests.java

@@ -19,6 +19,7 @@

 package org.apache.lucene.search.postingshighlight;

+import org.apache.lucene.search.highlight.Snippet;


javanna · 2016-11-17T17:23:06Z

I think docs are super important. We already have 3 highlighters and it is hard enough for users to figure out which one to use. Would be nice to come up with reasons to prefer this new implementation over the existing ones etc.

nik9000 · 2016-11-17T17:28:13Z

Would be nice to come up with reasons to prefer this new implementation over the existing ones etc.

I expect this one allows you to source the hits from postings while still using fvh-like snippet segmentation. So it works properly on more free form fields. I expect this but I have not confirmed it to be so.

jimczi · 2016-11-17T17:35:19Z

I expect this one allows you to source the hits from postings while still using fvh-like snippet segmentation. So it works properly on more free form fields. I expect this but I have not confirmed it to be so.

Well it's the other way around ;). The unified highlighter is a fork of the postings highlighter that can also works with term vectors or plain analysis. The snippet segmentation is exactly the same as the postings one. I only see this highlighter as a generalization of the postings highlighter. After some benchmarking it seems to perform quite well on plain analysis (twice faster than the original plain highlighter on wikipedia text) but the drawbacks are the same than for the original Postings Highlighter.

nik9000 · 2016-11-17T17:36:38Z

Well it's the other way around ;). The unified highlighter is a fork of the postings highlighter that can also works with term vectors or plain analysis. The snippet segmentation is exactly the same as the postings one. I only see this highlighter as a generalization of the postings highlighter. After some benchmarking it seems to perform quite well on plain analysis (twice faster than the original plain highlighter on wikipedia text) but the drawbacks are the same than for the original Postings Highlighter.

Ok then.....

javanna · 2016-11-17T17:40:13Z

@jimczi were your benchmarks based on lucene code or elasticsearch code from this PR?

jimczi · 2016-11-17T17:41:57Z

@javanna elasticsearch code from this PR. Well not exactly a benchmark though, only me playing with some queries ;)

clintongormley · 2016-11-24T19:16:46Z

Does the unified highlighter support fragment length? I tried it out on #9442 and it gives better results than the existing highlighter, but does not appear to limit the length.

clintongormley · 2016-11-24T19:23:00Z

Does this issue apply to the unified highlighter too? #11223

clintongormley · 2016-11-24T19:27:05Z

The unified highlighter doesn't seem to work with boundary characters but doesn't complain about them not being supported either. See #11777

dsmiley · 2016-12-12T15:24:44Z

Unless I misunderstand what ES's merge_fields is, I don't believe LUCENE-7575 handles it. AFAIK merge_fields would consume the positions from multiple fields and highlight one stored field. A use-case is analyzing the same text differently, perhaps with stemming in one field but not another, and then only storing the text once.

jimczi · 2016-12-12T16:27:43Z

Unless I misunderstand what ES's merge_fields is, I don't believe LUCENE-7575 handles it.

That's correct. What I meant is that LUCENE-7575 is more than require_field_match and that it allows to select which field to extract from the query. I realized now that I mixed up merged fields and matched_fields that does what you describe. I'll update the description, thanks.

jimczi · 2016-12-13T14:21:24Z

@elasticmachine retest this please

jimczi · 2017-01-24T18:49:04Z

I pushed another iteration now that we upgraded to the latest Lucene release. The unified highlighter now supports require_field_match. I think it is enough to merge this in 5.x.
@nik9000 can you take a look ?

nik9000

I left some comments. I'm not sure about the "rewrites with empty reader" thing. The _all query looks like it should be fine. I'm not sure about the other one. I'll have to do some more digging.

nik9000 · 2017-01-30T23:33:28Z

core/src/main/java/org/apache/lucene/search/uhighlight/CustomPassageFormatter.java

+import org.elasticsearch.search.fetch.subphase.highlight.HighlightUtils;
+
+/**
+Custom passage formatter that allows us to:


This isn't valid javadoc. I think we should also document why this has to be in the org.apache.lucene package.

We could move it to another package, the only reason to put it here is because we already have the custom postings highlighter in the org.apache.lucene package.

nik9000 · 2017-01-30T23:35:36Z

core/src/main/java/org/apache/lucene/search/uhighlight/CustomPassageFormatter.java

+1) extract different snippets (instead of a single big string) together with their scores ({@link Snippet})
+2) use the {@link Encoder} implementations that are already used with the other highlighters
+ */
+public class CustomPassageFormatter extends PassageFormatter {


Looks like they have to extend different classes with the same name....

nik9000 · 2017-01-30T23:36:11Z

core/src/main/java/org/apache/lucene/search/uhighlight/CustomUnifiedHighlighter.java

+
+/**
+ * Subclass of the {@link UnifiedHighlighter} that works for a single field in a single document.
+ * Uses a custom {@link org.apache.lucene.search.uhighlight.PassageFormatter}. Accepts field content as a constructor


I'd just import PassageFormatter and remove the package from the {@link bit.

nik9000 · 2017-01-30T23:36:58Z

core/src/main/java/org/apache/lucene/search/uhighlight/CustomUnifiedHighlighter.java

+     * Creates a new instance of {@link CustomUnifiedHighlighter}
+     *
+     * @param analyzer the analyzer used for the field at index time, used for multi term queries internally
+     * @param passageFormatter our own {@link org.apache.lucene.search.uhighlight.CustomPassageFormatter}


Another spot I wouldn't include the whole package for brevity.

nik9000 · 2017-01-31T00:18:00Z

core/src/main/java/org/elasticsearch/common/lucene/all/AllTermQuery.java

-            return new MatchNoDocsQuery();
-        }
+        // if the terms does not exist we could return a MatchNoDocsQuery but this would break the unified highlighter
+        // which rewrites query with an empty reader.


Is this just going to make queries again _all a little less efficient what _all is disabled? If so I think that is ok.

jimczi · 2017-01-31T13:53:02Z

Thanks @nik9000
I pushed some change to address your comments.
Regarding the "rewrites with empty reader" thing, this is required because the unified highlighter uses the rewritten query to build the automaton for the re-analysis strategy.

nik9000

Yeah. I think it is time we get this in so folks can experiment with it.

This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES. This highlighter has multiple modes: * plain: a mode that analyzes the plain text directly * postings: a mode that uses the postings offsets to perform the highlight * fvh: a mode that uses the term vectors to perform the highlighting By default the mode is choosen automatically depending on the type of the field. Currently it supports the following options: * `force_source` * `encoder` * `highlight_query` * `pre_tags and `post_tags`

jimczi · 2017-01-31T18:06:19Z

Thanks @nik9000 !
Now merging to 5.x

* Integrate UnifiedHighlighter This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES. This highlighter can extract offsets from either postings, term vectors, or via re-analyzing text. The best strategy is picked automatically at query time and depends on the field and the query to highlight.

intrafindBreno · 2017-02-21T23:18:09Z

The current integration doesn't seem to support setting the maxLength field in UnifiedHighlighter. Is this correct, or did I miss something?

jimczi · 2017-02-22T07:38:18Z

That's correct @fariaintrafind , the UnifiedHighlighter uses a sentence break iterator that does not take the maxFragmentLength in account. Though we plan to support this setting soon, it's on my todos ;)

lami02 · 2017-04-04T11:48:48Z

I have tested the new highlighter and it works great so far. The only setting that does not seem to work (except from the settings that are not supported yet) is no_match_size. The number_of_fragments setting doesn't seem to do anything either, but I suspect thats because it is not possible to set a fragment_size yet.

bleskes · 2017-04-04T11:56:17Z

@lami02 thanks for trying it out! do you mind opening a new issue with the errors you found? a concise reproduction with API calls will be great.

jimczi added :Search Relevance/Highlighting How a query matched a document >feature v6.0.0-alpha1 WIP labels Nov 17, 2016

jpountz approved these changes Nov 17, 2016

View reviewed changes

javanna reviewed Nov 17, 2016

View reviewed changes

clintongormley mentioned this pull request Nov 24, 2016

Bug when fvh highlighting a phrase query on an ngrammed field #10071

Closed

clintongormley mentioned this pull request Nov 24, 2016

Match phrase query fvh highlighter issue. #12648

Closed

jimczi added v5.2.0 and removed WIP labels Dec 12, 2016

clintongormley added v5.3.0 and removed v5.2.0 labels Jan 24, 2017

jimczi added the review label Jan 24, 2017

jimczi requested a review from nik9000 January 24, 2017 18:49

nik9000 reviewed Jan 31, 2017

View reviewed changes

nik9000 approved these changes Jan 31, 2017

View reviewed changes

jimczi added 4 commits January 31, 2017 19:01

Add require_field_match support

261a137

Address review comments

f416cbe

rest test

491c7fe

jimczi merged commit f6d38d4 into elastic:master Jan 31, 2017

jimczi deleted the unified_highlighter branch January 31, 2017 18:06

clintongormley added the release highlight label Jan 31, 2017

clintongormley mentioned this pull request Feb 6, 2017

In some cases FVH returns StringIndexOutOfBoundsException #22997

Closed

Mpdreamz mentioned this pull request Feb 15, 2017

Unified highlighter support elastic/elasticsearch-net#2595

Closed

chrinor2002 mentioned this pull request Dec 8, 2024

feature: unified highlighting sudo-suhas/elastic-builder#206

Merged

		@@ -19,6 +19,7 @@

		package org.apache.lucene.search.postingshighlight;

		import org.apache.lucene.search.highlight.Snippet;

Integrate UnifiedHighlighter #21621

Integrate UnifiedHighlighter #21621

Conversation

jimczi commented Nov 17, 2016 • edited Loading

jimczi commented Nov 17, 2016

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

javanna commented Nov 17, 2016

nik9000 commented Nov 17, 2016

jimczi commented Nov 17, 2016

nik9000 commented Nov 17, 2016

javanna commented Nov 17, 2016

jimczi commented Nov 17, 2016

clintongormley commented Nov 24, 2016

clintongormley commented Nov 24, 2016

clintongormley commented Nov 24, 2016

dsmiley commented Dec 12, 2016

jimczi commented Dec 12, 2016

jimczi commented Dec 13, 2016

jimczi commented Jan 24, 2017

nik9000 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimczi commented Jan 31, 2017

nik9000 left a comment

Choose a reason for hiding this comment

jimczi commented Jan 31, 2017

intrafindBreno commented Feb 21, 2017

jimczi commented Feb 22, 2017

lami02 commented Apr 4, 2017

bleskes commented Apr 4, 2017

jimczi commented Nov 17, 2016 •

edited

Loading