Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate UnifiedHighlighter #21621

Merged
merged 4 commits into from
Jan 31, 2017
Merged

Integrate UnifiedHighlighter #21621

merged 4 commits into from
Jan 31, 2017

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Nov 17, 2016

This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES.
This highlighter has multiple modes:

  • plain: a mode that analyzes the plain text directly
  • postings: a mode that uses the postings offsets to perform the highlight
  • fvh: a mode that uses the term vectors to perform the highlighting

Since this is a "unified" highlighter here is the complete list of highlighting features supported or not by this integration:

  • works on fields with term_vectors, offsets or simplystored(or extracted from _source).
  • force_source
  • pre_tags and post_tags
    Multiple tags are not allowed
  • encoder (html and default)
  • number_of_fragments
  • no_match_size
  • require_field_match
    Requires http://issues.apache.org/jira/browse/LUCENE-7575. Note that LUCENE-7575 is more than just require_field_match since it can select which fields to highlight
  • matched_fields
  • fragment_size
    Requires http://issues.apache.org/jira/browse/LUCENE-7565
  • boundary_chars
  • boundary_max_scan
  • Queries with rewrite that needs an index reader are ignored (MultiPhrasePrefixQuery, CommonTermsQuery, AllFieldQuery, ...).
  • Collapse contiguous highlights: <b>a</b><b>b</b><b>c</b> => <b>abc</b>

Fixes #21376

@jimczi
Copy link
Contributor Author

jimczi commented Nov 17, 2016

Documentation is missing (which is why this is a WIP ;) )

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some questions/comments but this looks good to me overall.

return snippets;
}

protected void append(StringBuilder dest, String content, int start, int end) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write some javadoc as to why a custom impl would want to override this?

if (bq.getQuery() instanceof TermQuery) {
prefixQuery.add(((TermQuery) bq.getQuery()).getTerm());
return prefixQuery;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a different change?

highlighters.register("unified", uh);
highlighters.register("unified_plain", uh);
highlighters.register("unified_postings", uh);
highlighters.register("unified_fvh", uh);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to register the same instance under multiple names?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see then we have a switch on the highlighter type in the impl, maybe write a comment about it here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The experimental highlighter used another field if you were going to override. It called it "hit_source" I think. I like that kind of thing better because it makes it more clear that you are only changing where the hits are calculated from not the actual highlighter implementation.


if (field.fieldOptions().scoreOrdered()) {
//let's sort the snippets by score if needed
CollectionUtil.introSort(snippets, (o1, o2) -> (int) Math.signum(o2.getScore() - o1.getScore()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use Double.compare rather than Math.signum?

@@ -22,6 +22,7 @@
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.Snippet;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover?

1) extract different snippets (instead of a single big string) together with their scores ({@link Snippet})
2) use the {@link Encoder} implementations that are already used with the other highlighters
*/
public class CustomPassageFormatter extends PassageFormatter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there no way to re-use the already existing CustomPassageFormatter class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like they have to extend different classes with the same name....

* under the License.
*/

package org.apache.lucene.search.uhighlight;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a big fan of uhighlight as a package name, maybe we can simply use "unifiedhighlight", since we already have "postingshighlight"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had similar thoughts (but were instead considering unifiedhighlighter as a package). Still it seemed too long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to file an issue to LUCENE to rename the package to "unifiedhighlight". It's labelled "@lucene.experimental" and was just released.

@@ -31,6 +31,7 @@
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.Snippet;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover?

@@ -19,6 +19,7 @@

package org.apache.lucene.search.postingshighlight;

import org.apache.lucene.search.highlight.Snippet;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover?

@javanna
Copy link
Member

javanna commented Nov 17, 2016

I think docs are super important. We already have 3 highlighters and it is hard enough for users to figure out which one to use. Would be nice to come up with reasons to prefer this new implementation over the existing ones etc.

@nik9000
Copy link
Member

nik9000 commented Nov 17, 2016

Would be nice to come up with reasons to prefer this new implementation over the existing ones etc.

I expect this one allows you to source the hits from postings while still using fvh-like snippet segmentation. So it works properly on more free form fields. I expect this but I have not confirmed it to be so.

@jimczi
Copy link
Contributor Author

jimczi commented Nov 17, 2016

I expect this one allows you to source the hits from postings while still using fvh-like snippet segmentation. So it works properly on more free form fields. I expect this but I have not confirmed it to be so.

Well it's the other way around ;). The unified highlighter is a fork of the postings highlighter that can also works with term vectors or plain analysis. The snippet segmentation is exactly the same as the postings one. I only see this highlighter as a generalization of the postings highlighter. After some benchmarking it seems to perform quite well on plain analysis (twice faster than the original plain highlighter on wikipedia text) but the drawbacks are the same than for the original Postings Highlighter.

@nik9000
Copy link
Member

nik9000 commented Nov 17, 2016

Well it's the other way around ;). The unified highlighter is a fork of the postings highlighter that can also works with term vectors or plain analysis. The snippet segmentation is exactly the same as the postings one. I only see this highlighter as a generalization of the postings highlighter. After some benchmarking it seems to perform quite well on plain analysis (twice faster than the original plain highlighter on wikipedia text) but the drawbacks are the same than for the original Postings Highlighter.

Ok then.....

@javanna
Copy link
Member

javanna commented Nov 17, 2016

@jimczi were your benchmarks based on lucene code or elasticsearch code from this PR?

@jimczi
Copy link
Contributor Author

jimczi commented Nov 17, 2016

@javanna elasticsearch code from this PR. Well not exactly a benchmark though, only me playing with some queries ;)

@clintongormley
Copy link
Contributor

Does the unified highlighter support fragment length? I tried it out on #9442 and it gives better results than the existing highlighter, but does not appear to limit the length.

@clintongormley
Copy link
Contributor

Does this issue apply to the unified highlighter too? #11223

@clintongormley
Copy link
Contributor

The unified highlighter doesn't seem to work with boundary characters but doesn't complain about them not being supported either. See #11777

@jimczi jimczi added v5.2.0 and removed WIP labels Dec 12, 2016
@dsmiley
Copy link
Contributor

dsmiley commented Dec 12, 2016

Unless I misunderstand what ES's merge_fields is, I don't believe LUCENE-7575 handles it. AFAIK merge_fields would consume the positions from multiple fields and highlight one stored field. A use-case is analyzing the same text differently, perhaps with stemming in one field but not another, and then only storing the text once.

@jimczi
Copy link
Contributor Author

jimczi commented Dec 12, 2016

Unless I misunderstand what ES's merge_fields is, I don't believe LUCENE-7575 handles it.

That's correct. What I meant is that LUCENE-7575 is more than require_field_match and that it allows to select which field to extract from the query. I realized now that I mixed up merged fields and matched_fields that does what you describe. I'll update the description, thanks.

@jimczi
Copy link
Contributor Author

jimczi commented Dec 13, 2016

@elasticmachine retest this please

@jimczi
Copy link
Contributor Author

jimczi commented Jan 24, 2017

I pushed another iteration now that we upgraded to the latest Lucene release. The unified highlighter now supports require_field_match. I think it is enough to merge this in 5.x.
@nik9000 can you take a look ?

@jimczi jimczi added the review label Jan 24, 2017
@jimczi jimczi requested a review from nik9000 January 24, 2017 18:49
Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. I'm not sure about the "rewrites with empty reader" thing. The _all query looks like it should be fine. I'm not sure about the other one. I'll have to do some more digging.

import org.elasticsearch.search.fetch.subphase.highlight.HighlightUtils;

/**
Custom passage formatter that allows us to:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't valid javadoc. I think we should also document why this has to be in the org.apache.lucene package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move it to another package, the only reason to put it here is because we already have the custom postings highlighter in the org.apache.lucene package.

1) extract different snippets (instead of a single big string) together with their scores ({@link Snippet})
2) use the {@link Encoder} implementations that are already used with the other highlighters
*/
public class CustomPassageFormatter extends PassageFormatter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like they have to extend different classes with the same name....


/**
* Subclass of the {@link UnifiedHighlighter} that works for a single field in a single document.
* Uses a custom {@link org.apache.lucene.search.uhighlight.PassageFormatter}. Accepts field content as a constructor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just import PassageFormatter and remove the package from the {@link bit.

* Creates a new instance of {@link CustomUnifiedHighlighter}
*
* @param analyzer the analyzer used for the field at index time, used for multi term queries internally
* @param passageFormatter our own {@link org.apache.lucene.search.uhighlight.CustomPassageFormatter}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another spot I wouldn't include the whole package for brevity.

return new MatchNoDocsQuery();
}
// if the terms does not exist we could return a MatchNoDocsQuery but this would break the unified highlighter
// which rewrites query with an empty reader.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just going to make queries again _all a little less efficient what _all is disabled? If so I think that is ok.

@jimczi
Copy link
Contributor Author

jimczi commented Jan 31, 2017

Thanks @nik9000
I pushed some change to address your comments.
Regarding the "rewrites with empty reader" thing, this is required because the unified highlighter uses the rewritten query to build the automaton for the re-analysis strategy.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think it is time we get this in so folks can experiment with it.

This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES.
This highlighter has multiple modes:
  * plain: a mode that analyzes the plain text directly
  * postings: a mode that uses the postings offsets to perform the highlight
  * fvh: a mode that uses the term vectors to perform the highlighting
By default the mode is choosen automatically depending on the type of the field.
Currently it supports the following options:
* `force_source`
* `encoder`
* `highlight_query`
* `pre_tags and `post_tags`
@jimczi jimczi merged commit f6d38d4 into elastic:master Jan 31, 2017
@jimczi jimczi deleted the unified_highlighter branch January 31, 2017 18:06
@jimczi
Copy link
Contributor Author

jimczi commented Jan 31, 2017

Thanks @nik9000 !
Now merging to 5.x

jimczi added a commit that referenced this pull request Jan 31, 2017
* Integrate UnifiedHighlighter

This change integrates the Lucene highlighter called "unified" in the list of supported highlighters for ES.
This highlighter can extract offsets from either postings, term vectors, or via re-analyzing text.
The best strategy is picked automatically at query time and depends on the field and the query to highlight.
@intrafindBreno
Copy link

The current integration doesn't seem to support setting the maxLength field in UnifiedHighlighter. Is this correct, or did I miss something?

@jimczi
Copy link
Contributor Author

jimczi commented Feb 22, 2017

That's correct @fariaintrafind , the UnifiedHighlighter uses a sentence break iterator that does not take the maxFragmentLength in account. Though we plan to support this setting soon, it's on my todos ;)

@lami02
Copy link

lami02 commented Apr 4, 2017

I have tested the new highlighter and it works great so far. The only setting that does not seem to work (except from the settings that are not supported yet) is no_match_size. The number_of_fragments setting doesn't seem to do anything either, but I suspect thats because it is not possible to set a fragment_size yet.

@bleskes
Copy link
Contributor

bleskes commented Apr 4, 2017

@lami02 thanks for trying it out! do you mind opening a new issue with the errors you found? a concise reproduction with API calls will be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate UnifiedHighlighter
10 participants