Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index phrases #30450

Merged
merged 24 commits into from
Jun 4, 2018
Merged

Index phrases #30450

merged 24 commits into from
Jun 4, 2018

Conversation

romseygeek
Copy link
Contributor

This commit adds an index_phrases option to text fields, which if enabled creates a subsidiary field containing two-term shingles from the main field. Phrase match queries are then redirected to this subsidiary field.

The index prefix field is normally indexed as docs-only, given that it cannot
be used in phrases.  However, in the case that the parent field has been indexed
with offsets, or has term-vector offsets, we should also store this in the index
prefix field for highlighting.

Note that this commit does not implement highlighting on prefix fields, but
rather ensures that future work can implement this without a backwards-break
in index data.

Closes #28994
Specifying `index_phrases: true` on a text field mapping will add a subsidiary
[field]._index_phrase field, indexing two-term shingles from the parent field.
The parent analysis chain is re-used, wrapped with a FixedShingleFilter.

At query time, if a phrase match query is executed, the mapping will redirect it
to run against the subsidiary field.

This should trade faster phrase querying for a larger index and longer indexing
times.
@romseygeek romseygeek self-assigned this May 8, 2018
@romseygeek romseygeek requested review from jimczi, colings86 and jpountz May 8, 2018 08:50
@romseygeek romseygeek added the :Search/Search Search-related issues that do not fall into other categories label May 8, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. I think we should have a simple function in MappedFieldType that can build phrase queries from a text+analyzer. This would simplify the usage and keep the focus on phrase queries building.


PhraseFieldType(String name) {
setTokenized(true);
setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should index offsets if the parent field activated them.

protected Query analyzePhrase(String field, TokenStream stream, int slop) throws IOException {
assert slop == 0;
Query q = super.analyzePhrase(field + FAST_PHRASE_SUFFIX, new FixedShingleFilter(stream, 2), slop);
logger.info("Phrase query: " + q);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug, trace ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if you search for a single term in a phrase query ? Shouldn't we fallback to the parent field since it won't match the shingles ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering what happens in case the token stream has gaps because of stop words since we can't use the shingle field in that case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work since FixedShingleFilter adds gap in the shingle in the form of a filler token (defaults to _).

@@ -80,7 +80,7 @@ private ICUTokenizerConfig getIcuConfig(Environment env, Settings settings) {
if (tailored.isEmpty()) {
return null;
} else {
final BreakIterator breakers[] = new BreakIterator[UScript.CODE_LIMIT];
final RuleBasedBreakIterator breakers[] = new RuleBasedBreakIterator[UScript.CODE_LIMIT];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is already in master

@@ -105,7 +105,7 @@ public RuleBasedBreakIterator getBreakIterator(int script) {
}

//parse a single RBBi rule file
private BreakIterator parseRules(String filename, Environment env) throws IOException {
private RuleBasedBreakIterator parseRules(String filename, Environment env) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@@ -360,6 +361,10 @@ public Query nullValueQuery() {

public abstract Query existsQuery(QueryShardContext context);

public MatchQuery matchQuery(QueryShardContext context, String analyzer, int slop) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be changed to analyzePhrase(QueryShardContext context, String text, Analyzer analyzer, int slop) ?
It seems to much to allow this override so I'd prefer something more focus on the real purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even further: phraseQuery(BytesRef[] terms, int[] positions), ie. only a factory for phrase queries and still leave analysis up to the query parser

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I think the FixedShingleFilter should be able to produce the shingles from the terms and positions even if there are gaps in the phrase query (produced by stop filter).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi I would be cautious about that. This optimization should return the same hits as a phrase query on the original field. On the regular field, a query for "the fox" would be parsed as a term query on fox, so it feels wrong to me that we run it as a term query on _ fox on the shingle field, as it would have different matches. I'd rather like to document that this option can't optimize queries with stop words.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I agree this is trappy so +1 to disable this optim when gaps are detected and to add documentation about this limitation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the signature to analyzePhrase(String, TokenStream, int), and copied the logic from QueryBuilder#analyzePhrase, which simplifies things a lot. Stop words should be handled automatically, because QueryBuilder itself only calls analyzePhrase if the tokenstream has length > 1. So "the fox" would be analyzed to just fox and parsed as a term query, while "Reynard the fox" would get converted to a phrase query for "reynard _" "_ fox"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not want the latter, since the phrase query on the regular field would match everything that has fox 2 positions after Reynard, while a phrase query for "reynard _" "_ fox" adds the additional constraint that there is no token between them. For instance "Reynard cunning fox" would match on the regular field but not on the shingle field.

Even though this might better fit the original intention of the user, I'd like this feature to be as transparent as possible and not change the set of matches for a given query?

}
matchQuery.setPhraseSlop(slop);
MappedFieldType fieldType = context.fieldMapper(fieldName);
MatchQuery matchQuery = fieldType.matchQuery(context, analyzer, slop);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the same work for multi_match, query_string and simple_query_string ? I don't think you need to override match_query, calling MappedFieldType#analyzePhrase should be enough ? There's also an issue for query time synonyms where we use span queries instead of phrases but this can be done in a follow up.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left comments but I'm very excited about this feature! I'm curious whether it would integrate directly with the match phrase prefix query or whether some work needs to be done there too to make it perform efficient phrases when shingles are indexe?

@@ -150,6 +150,9 @@ option. ({pull}30140[#29658])
A new analysis plugin called `analysis_nori` that exposes the Lucene Korean
analysis module. ({pull}30397[#30397])

Text fields now have an `index_phrases` parameter that allows faster phrase queries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a non native speaker it feels weird that index_prefix is singular while index_phrases is plural?

@@ -360,6 +361,10 @@ public Query nullValueQuery() {

public abstract Query existsQuery(QueryShardContext context);

public MatchQuery matchQuery(QueryShardContext context, String analyzer, int slop) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even further: phraseQuery(BytesRef[] terms, int[] positions), ie. only a factory for phrase queries and still leave analysis up to the query parser


If enabled, two-term word combinations ('shingles') are indexed into a separate
field. This allows phrase queries to run more efficiently, at the expense
of a larger index. Accepts `true` or `false` (default).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should document that this feature works better when stop words are not removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

protected Query analyzePhrase(String field, TokenStream stream, int slop) throws IOException {
assert slop == 0;
Query q = super.analyzePhrase(field + FAST_PHRASE_SUFFIX, new FixedShingleFilter(stream, 2), slop);
logger.info("Phrase query: " + q);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering what happens in case the token stream has gaps because of stop words since we can't use the shingle field in that case?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @romseygeek . I left some comments regarding synonyms and some query option that we need to handle.

if (analyzer != null) {
matchQuery.setAnalyzer(analyzer);
}
matchQuery.setAnalyzer(analyzer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check not needed ? setAnalyzer will throw an exception if analyzer is null.


stream.reset();
while (stream.incrementToken()) {
position += posIncrAtt.getPositionIncrement();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an option in query_string to disallow position increment (enable_position_increments) so I guess that we should apply it here too.


If enabled, two-term word combinations ('shingles') are indexed into a separate
field. This allows phrase queries to run more efficiently, at the expense
of a larger index. Accepts `true` or `false` (default).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


Query q3 = new MatchPhraseQueryBuilder("field", "two words").slop(1).toQuery(queryShardContext);
assertThat(q3, is(new PhraseQuery(1, "field", "two", "word")));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a test with a single term ?

@@ -360,7 +360,7 @@ protected Query analyzePhrase(String field, TokenStream stream, int slop) throws
throw exc;
}
}
Query query = super.analyzePhrase(field, stream, slop);
Query query = mapper.analyzePhrase(field, stream, slop);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analyzePhrase is called when there are no synonyms in the query so I think we should also override analyzeMultiPhrase to handle single term synonyms. Since the fixed size shingle filter doesn't handle multi-terms synonyms there is no need to override analyzeGraphPhrase which should continue to use the main field.

return new TextFieldMapper(
name, fieldType, defaultFieldType, positionIncrementGap, prefixMapper,
name, fieldType, defaultFieldType, positionIncrementGap, prefixMapper, indexPhrases,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not need to pass it to TextFieldMapper if it is already in the FieldType?

@@ -420,6 +513,10 @@ void setPrefixFieldType(PrefixFieldType prefixFieldType) {
this.prefixFieldType = prefixFieldType;
}

void indexPhrases(boolean indexPhrases) {
this.indexPhrases = indexPhrases;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should update TextFieldTypeTests to add a new modifier for this property

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd missed TextFieldTypeTest - I also need to add in checks for index_prefix here. I agree with you that it's odd to have index_prefix and index_phrases, so I think we should change the former to be plural as well. I'll open a separate PR for that change, and to add the TextFieldTypeTest modifications for index_prefix

@@ -455,6 +552,28 @@ public Query nullValueQuery() {
return termQuery(nullValue(), null);
}

@Override
public Query analyzePhrase(String field, TokenStream stream, int slop) throws IOException {
if (indexPhrases && slop == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should also check that there are no gaps in the stream

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments/questions but I like it a lot and think it's very close now. I didn't see any validation around the index options but I think we should disallow enabling this option if the index options do not index positions?

@@ -360,6 +362,10 @@ public Query nullValueQuery() {

public abstract Query existsQuery(QueryShardContext context);

public Query analyzePhrase(String field, TokenStream stream, int slop) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe phraseQuery would be more consistent with other methods here (existsQuery, termQuery, rangeQuery)?

@@ -360,6 +362,10 @@ public Query nullValueQuery() {

public abstract Query existsQuery(QueryShardContext context);

public Query analyzePhrase(String field, TokenStream stream, int slop) throws IOException {
throw new IllegalArgumentException("Can only use phrase queries on keyword and text fields - not on [" + name + "] which is of type [" + typeName() + "]");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe only mention text fields in the error message, even if we don't fail phrase queries on keywords, I'm not sure this is very useful?

@@ -106,6 +119,11 @@ public Builder fielddata(boolean fielddata) {
return builder;
}

public Builder indexPhrases(boolean indexPhrases) {
this.indexPhrases = indexPhrases;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we directly set on the field type?

setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
}
if (parent.storeTermVectorOffsets()) {
setStoreTermVectorOffsets(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to copy other storeTermVectorXXX settings?

@@ -420,6 +520,11 @@ void setPrefixFieldType(PrefixFieldType prefixFieldType) {
this.prefixFieldType = prefixFieldType;
}

void indexPhrases(boolean indexPhrases) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix with set to be consistent with most other setters on this class?

PositionIncrementAttribute posIncAtt = stream.getAttribute(PositionIncrementAttribute.class);
stream.reset();
while (stream.incrementToken()) {
if (posIncAtt.getPositionIncrement() > 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put brackets to be consistent with the rest of the code base?

if (indexPhrases && slop == 0 && hasGaps(stream) == false) {
stream = new FixedShingleFilter(stream, 2);
field = field + FAST_PHRASE_SUFFIX;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like this to work on all token streams and not only cached token streams, so maybe we should do something like:

if (indexPhrases && slop == 0) {
  CachingTokenFilter cachedStream;
  if (stream instanceof CachingTokenFilter) {
    cachedStream = (CachingTokenFilter) stream;
  } else {
    stream = cachedStream = new CachedTokenStrem(stream);
  }
  if (hasGaps(cachedStream) == false) {
    stream = new FixedShingleFilter(stream, 2);
    field = field + FAST_PHRASE_SUFFIX;
  }
}

and change the signature of hasGaps to take a CachingTokenFilter?

@@ -540,6 +691,10 @@ else if (this.prefixFieldMapper != null || mw.prefixFieldMapper != null) {
throw new IllegalArgumentException("mapper [" + name() + "] has different index_prefix settings, current ["
+ this.prefixFieldMapper + "], merged [" + mw.prefixFieldMapper + "]");
}
else if (this.fieldType().indexPhrases != mw.fieldType().indexPhrases) {
throw new IllegalArgumentException("mapper [" + name() + "] has different index_phrase settings, current ["
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/index_phrase/index_phrases/ ?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me overall, we just need one more check that positions are indexed on the field when index_phrases is set to true?

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks Alan!

@romseygeek
Copy link
Contributor Author

@jimczi are you good with this too?

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @romseygeek !

@romseygeek romseygeek merged commit 0427339 into elastic:master Jun 4, 2018
@romseygeek romseygeek deleted the index-phrases branch June 4, 2018 07:50
romseygeek added a commit that referenced this pull request Jun 4, 2018
Specifying `index_phrases: true` on a text field mapping will add a subsidiary
[field]._index_phrase field, indexing two-term shingles from the parent field.
The parent analysis chain is re-used, wrapped with a FixedShingleFilter.

At query time, if a phrase match query is executed, the mapping will redirect it
to run against the subsidiary field.

This should trade faster phrase querying for a larger index and longer indexing
times.

Relates to #27049
jimczi added a commit that referenced this pull request Jun 4, 2018
jimczi added a commit that referenced this pull request Jun 4, 2018
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jun 4, 2018
* master:
  Remove usage of explicit type in docs (elastic#29667)
  Share common readFrom/writeTo code in AcknowledgeResponse (elastic#30983)
  Adapt bwc versions after backporting elastic#31045 to 6.x
  Mute MatchPhrase*QueryBuilderTests
  [Docs] Fix typo in watcher conditions documentation (elastic#30989)
  Remove wrong link in index phrases doc
  Move pipeline APIs to ingest namespace (elastic#31027)
  [DOCS] Fixes accounting setting names (elastic#30863)
  [DOCS] Rewords _field_names documentation (elastic#31029)
  Index phrases (elastic#30450)
dnhatn added a commit that referenced this pull request Jun 4, 2018
* master:
  Match phrase queries against non-indexed fields should throw an exception (#31060)
  In the internal highlighter APIs, use the field type as opposed to the mapper. (#31039)
  [DOCS] Removes duplicated authentication pages
  Enable customizing REST tests blacklist (#31074)
  Make sure KeywordFieldMapper#clone preserves split_queries_on_whitespace. (#31049)
  [DOCS] Moves machine learning overview to stack-docs
  [ML] Add secondary sort to ML events (#31063)
  [Rollup] Specialize validation exception for easier management (#30339)
  Adapt bwc versions after backporting #31045 to 6.3
  Remove usage of explicit type in docs (#29667)
  Share common readFrom/writeTo code in AcknowledgeResponse (#30983)
  Adapt bwc versions after backporting #31045 to 6.x
  Mute MatchPhrase*QueryBuilderTests
  [Docs] Fix typo in watcher conditions documentation (#30989)
  Remove wrong link in index phrases doc
  Move pipeline APIs to ingest namespace (#31027)
  [DOCS] Fixes accounting setting names (#30863)
  [DOCS] Rewords _field_names documentation (#31029)
  Index phrases (#30450)
  Remove leftover debugging from PTCMDT
  Fix PTCMDT#testMinVersionSerialization
  Make Persistent Tasks implementations version and feature aware (#31045)
dnhatn added a commit that referenced this pull request Jun 4, 2018
* 6.x:
  Add TRACE, CONNECT, and PATCH http methods (#31079)
  Change ObjectParser exception (#31030)
  Make sure KeywordFieldMapper#clone preserves split_queries_on_whitespace. (#31049)
  [DOCS] Removes duplicated authentication pages
  [Rollup] Specialize validation exception for easier management (#30339)
  Enable customizing REST tests blacklist (#31074)
  [DOCS] Moves machine learning overview to stack-docs
  [ML] Add secondary sort to ML events (#31063)
  QA: Check rollup job creation safety (#31036)
  Adapt bwc versions after backporting #31045 to 6.3
  Remove usage of explicit type in docs (#29667)
  Move pipeline APIs to ingest namespace (#31027)
  Adapt bwc versions after backporting #31045 to 6.x
  Make Persistent Tasks implementations version and feature aware (#31045)
  Mute MatchPhrase*QueryBuilderTests
  [Docs] Fix typo in watcher conditions documentation (#30989)
  Remove wrong link in index phrases doc
  Reuse expiration date of trial licenses (#31033)
  Index phrases (#30450)
  [DOCS] Fixes accounting setting names (#30863)
  [DOCS] Rewords _field_names documentation (#31029)
jimczi added a commit that referenced this pull request Jun 5, 2018
The test cannot run when no type is registered.

Relates #30450
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :Search/Search Search-related issues that do not fall into other categories v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants