-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index phrases #30450
Index phrases #30450
Conversation
The index prefix field is normally indexed as docs-only, given that it cannot be used in phrases. However, in the case that the parent field has been indexed with offsets, or has term-vector offsets, we should also store this in the index prefix field for highlighting. Note that this commit does not implement highlighting on prefix fields, but rather ensures that future work can implement this without a backwards-break in index data. Closes #28994
Specifying `index_phrases: true` on a text field mapping will add a subsidiary [field]._index_phrase field, indexing two-term shingles from the parent field. The parent analysis chain is re-used, wrapped with a FixedShingleFilter. At query time, if a phrase match query is executed, the mapping will redirect it to run against the subsidiary field. This should trade faster phrase querying for a larger index and longer indexing times.
Pinging @elastic/es-search-aggs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. I think we should have a simple function in MappedFieldType
that can build phrase queries from a text+analyzer. This would simplify the usage and keep the focus on phrase queries building.
|
||
PhraseFieldType(String name) { | ||
setTokenized(true); | ||
setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should index offsets if the parent field activated them.
protected Query analyzePhrase(String field, TokenStream stream, int slop) throws IOException { | ||
assert slop == 0; | ||
Query q = super.analyzePhrase(field + FAST_PHRASE_SUFFIX, new FixedShingleFilter(stream, 2), slop); | ||
logger.info("Phrase query: " + q); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug, trace ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if you search for a single term in a phrase query ? Shouldn't we fallback to the parent field since it won't match the shingles ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also wondering what happens in case the token stream has gaps because of stop words since we can't use the shingle field in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work since FixedShingleFilter
adds gap in the shingle in the form of a filler token (defaults to _
).
@@ -80,7 +80,7 @@ private ICUTokenizerConfig getIcuConfig(Environment env, Settings settings) { | |||
if (tailored.isEmpty()) { | |||
return null; | |||
} else { | |||
final BreakIterator breakers[] = new BreakIterator[UScript.CODE_LIMIT]; | |||
final RuleBasedBreakIterator breakers[] = new RuleBasedBreakIterator[UScript.CODE_LIMIT]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is already in master
@@ -105,7 +105,7 @@ public RuleBasedBreakIterator getBreakIterator(int script) { | |||
} | |||
|
|||
//parse a single RBBi rule file | |||
private BreakIterator parseRules(String filename, Environment env) throws IOException { | |||
private RuleBasedBreakIterator parseRules(String filename, Environment env) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
@@ -360,6 +361,10 @@ public Query nullValueQuery() { | |||
|
|||
public abstract Query existsQuery(QueryShardContext context); | |||
|
|||
public MatchQuery matchQuery(QueryShardContext context, String analyzer, int slop) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be changed to analyzePhrase(QueryShardContext context, String text, Analyzer analyzer, int slop)
?
It seems to much to allow this override so I'd prefer something more focus on the real purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or even further: phraseQuery(BytesRef[] terms, int[] positions)
, ie. only a factory for phrase queries and still leave analysis up to the query parser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I think the FixedShingleFilter
should be able to produce the shingles from the terms
and positions
even if there are gaps in the phrase query (produced by stop filter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jimczi I would be cautious about that. This optimization should return the same hits as a phrase query on the original field. On the regular field, a query for "the fox"
would be parsed as a term query on fox
, so it feels wrong to me that we run it as a term query on _ fox
on the shingle field, as it would have different matches. I'd rather like to document that this option can't optimize queries with stop words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. I agree this is trappy so +1 to disable this optim when gaps are detected and to add documentation about this limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the signature to analyzePhrase(String, TokenStream, int)
, and copied the logic from QueryBuilder#analyzePhrase
, which simplifies things a lot. Stop words should be handled automatically, because QueryBuilder itself only calls analyzePhrase
if the tokenstream has length > 1. So "the fox"
would be analyzed to just fox
and parsed as a term query, while "Reynard the fox"
would get converted to a phrase query for "reynard _" "_ fox"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might not want the latter, since the phrase query on the regular field would match everything that has fox
2 positions after Reynard
, while a phrase query for "reynard _" "_ fox"
adds the additional constraint that there is no token between them. For instance "Reynard cunning fox" would match on the regular field but not on the shingle field.
Even though this might better fit the original intention of the user, I'd like this feature to be as transparent as possible and not change the set of matches for a given query?
} | ||
matchQuery.setPhraseSlop(slop); | ||
MappedFieldType fieldType = context.fieldMapper(fieldName); | ||
MatchQuery matchQuery = fieldType.matchQuery(context, analyzer, slop); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the same work for multi_match
, query_string
and simple_query_string
? I don't think you need to override match_query
, calling MappedFieldType#analyzePhrase
should be enough ? There's also an issue for query time synonyms where we use span queries instead of phrases but this can be done in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left comments but I'm very excited about this feature! I'm curious whether it would integrate directly with the match phrase prefix query or whether some work needs to be done there too to make it perform efficient phrases when shingles are indexe?
docs/CHANGELOG.asciidoc
Outdated
@@ -150,6 +150,9 @@ option. ({pull}30140[#29658]) | |||
A new analysis plugin called `analysis_nori` that exposes the Lucene Korean | |||
analysis module. ({pull}30397[#30397]) | |||
|
|||
Text fields now have an `index_phrases` parameter that allows faster phrase queries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a non native speaker it feels weird that index_prefix
is singular while index_phrases
is plural?
@@ -360,6 +361,10 @@ public Query nullValueQuery() { | |||
|
|||
public abstract Query existsQuery(QueryShardContext context); | |||
|
|||
public MatchQuery matchQuery(QueryShardContext context, String analyzer, int slop) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or even further: phraseQuery(BytesRef[] terms, int[] positions)
, ie. only a factory for phrase queries and still leave analysis up to the query parser
|
||
If enabled, two-term word combinations ('shingles') are indexed into a separate | ||
field. This allows phrase queries to run more efficiently, at the expense | ||
of a larger index. Accepts `true` or `false` (default). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should document that this feature works better when stop words are not removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
protected Query analyzePhrase(String field, TokenStream stream, int slop) throws IOException { | ||
assert slop == 0; | ||
Query q = super.analyzePhrase(field + FAST_PHRASE_SUFFIX, new FixedShingleFilter(stream, 2), slop); | ||
logger.info("Phrase query: " + q); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also wondering what happens in case the token stream has gaps because of stop words since we can't use the shingle field in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @romseygeek . I left some comments regarding synonyms and some query option that we need to handle.
if (analyzer != null) { | ||
matchQuery.setAnalyzer(analyzer); | ||
} | ||
matchQuery.setAnalyzer(analyzer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check not needed ? setAnalyzer
will throw an exception if analyzer
is null.
|
||
stream.reset(); | ||
while (stream.incrementToken()) { | ||
position += posIncrAtt.getPositionIncrement(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an option in query_string
to disallow position increment (enable_position_increments
) so I guess that we should apply it here too.
|
||
If enabled, two-term word combinations ('shingles') are indexed into a separate | ||
field. This allows phrase queries to run more efficiently, at the expense | ||
of a larger index. Accepts `true` or `false` (default). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
||
Query q3 = new MatchPhraseQueryBuilder("field", "two words").slop(1).toQuery(queryShardContext); | ||
assertThat(q3, is(new PhraseQuery(1, "field", "two", "word"))); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a test with a single term ?
@@ -360,7 +360,7 @@ protected Query analyzePhrase(String field, TokenStream stream, int slop) throws | |||
throw exc; | |||
} | |||
} | |||
Query query = super.analyzePhrase(field, stream, slop); | |||
Query query = mapper.analyzePhrase(field, stream, slop); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
analyzePhrase
is called when there are no synonyms in the query so I think we should also override analyzeMultiPhrase
to handle single term synonyms. Since the fixed size shingle filter doesn't handle multi-terms synonyms there is no need to override analyzeGraphPhrase
which should continue to use the main field.
return new TextFieldMapper( | ||
name, fieldType, defaultFieldType, positionIncrementGap, prefixMapper, | ||
name, fieldType, defaultFieldType, positionIncrementGap, prefixMapper, indexPhrases, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not need to pass it to TextFieldMapper if it is already in the FieldType?
@@ -420,6 +513,10 @@ void setPrefixFieldType(PrefixFieldType prefixFieldType) { | |||
this.prefixFieldType = prefixFieldType; | |||
} | |||
|
|||
void indexPhrases(boolean indexPhrases) { | |||
this.indexPhrases = indexPhrases; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should update TextFieldTypeTests
to add a new modifier for this property
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd missed TextFieldTypeTest - I also need to add in checks for index_prefix
here. I agree with you that it's odd to have index_prefix
and index_phrases
, so I think we should change the former to be plural as well. I'll open a separate PR for that change, and to add the TextFieldTypeTest modifications for index_prefix
@@ -455,6 +552,28 @@ public Query nullValueQuery() { | |||
return termQuery(nullValue(), null); | |||
} | |||
|
|||
@Override | |||
public Query analyzePhrase(String field, TokenStream stream, int slop) throws IOException { | |||
if (indexPhrases && slop == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should also check that there are no gaps in the stream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments/questions but I like it a lot and think it's very close now. I didn't see any validation around the index options but I think we should disallow enabling this option if the index options do not index positions?
@@ -360,6 +362,10 @@ public Query nullValueQuery() { | |||
|
|||
public abstract Query existsQuery(QueryShardContext context); | |||
|
|||
public Query analyzePhrase(String field, TokenStream stream, int slop) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe phraseQuery
would be more consistent with other methods here (existsQuery
, termQuery
, rangeQuery
)?
@@ -360,6 +362,10 @@ public Query nullValueQuery() { | |||
|
|||
public abstract Query existsQuery(QueryShardContext context); | |||
|
|||
public Query analyzePhrase(String field, TokenStream stream, int slop) throws IOException { | |||
throw new IllegalArgumentException("Can only use phrase queries on keyword and text fields - not on [" + name + "] which is of type [" + typeName() + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe only mention text fields in the error message, even if we don't fail phrase queries on keywords, I'm not sure this is very useful?
@@ -106,6 +119,11 @@ public Builder fielddata(boolean fielddata) { | |||
return builder; | |||
} | |||
|
|||
public Builder indexPhrases(boolean indexPhrases) { | |||
this.indexPhrases = indexPhrases; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we directly set on the field type?
setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS); | ||
} | ||
if (parent.storeTermVectorOffsets()) { | ||
setStoreTermVectorOffsets(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to copy other storeTermVectorXXX settings?
@@ -420,6 +520,11 @@ void setPrefixFieldType(PrefixFieldType prefixFieldType) { | |||
this.prefixFieldType = prefixFieldType; | |||
} | |||
|
|||
void indexPhrases(boolean indexPhrases) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefix with set
to be consistent with most other setters on this class?
PositionIncrementAttribute posIncAtt = stream.getAttribute(PositionIncrementAttribute.class); | ||
stream.reset(); | ||
while (stream.incrementToken()) { | ||
if (posIncAtt.getPositionIncrement() > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's put brackets to be consistent with the rest of the code base?
if (indexPhrases && slop == 0 && hasGaps(stream) == false) { | ||
stream = new FixedShingleFilter(stream, 2); | ||
field = field + FAST_PHRASE_SUFFIX; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like this to work on all token streams and not only cached token streams, so maybe we should do something like:
if (indexPhrases && slop == 0) {
CachingTokenFilter cachedStream;
if (stream instanceof CachingTokenFilter) {
cachedStream = (CachingTokenFilter) stream;
} else {
stream = cachedStream = new CachedTokenStrem(stream);
}
if (hasGaps(cachedStream) == false) {
stream = new FixedShingleFilter(stream, 2);
field = field + FAST_PHRASE_SUFFIX;
}
}
and change the signature of hasGaps
to take a CachingTokenFilter
?
@@ -540,6 +691,10 @@ else if (this.prefixFieldMapper != null || mw.prefixFieldMapper != null) { | |||
throw new IllegalArgumentException("mapper [" + name() + "] has different index_prefix settings, current [" | |||
+ this.prefixFieldMapper + "], merged [" + mw.prefixFieldMapper + "]"); | |||
} | |||
else if (this.fieldType().indexPhrases != mw.fieldType().indexPhrases) { | |||
throw new IllegalArgumentException("mapper [" + name() + "] has different index_phrase settings, current [" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/index_phrase/index_phrases/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me overall, we just need one more check that positions are indexed on the field when index_phrases is set to true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks Alan!
@jimczi are you good with this too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @romseygeek !
Specifying `index_phrases: true` on a text field mapping will add a subsidiary [field]._index_phrase field, indexing two-term shingles from the parent field. The parent analysis chain is re-used, wrapped with a FixedShingleFilter. At query time, if a phrase match query is executed, the mapping will redirect it to run against the subsidiary field. This should trade faster phrase querying for a larger index and longer indexing times. Relates to #27049
* master: Remove usage of explicit type in docs (elastic#29667) Share common readFrom/writeTo code in AcknowledgeResponse (elastic#30983) Adapt bwc versions after backporting elastic#31045 to 6.x Mute MatchPhrase*QueryBuilderTests [Docs] Fix typo in watcher conditions documentation (elastic#30989) Remove wrong link in index phrases doc Move pipeline APIs to ingest namespace (elastic#31027) [DOCS] Fixes accounting setting names (elastic#30863) [DOCS] Rewords _field_names documentation (elastic#31029) Index phrases (elastic#30450)
* master: Match phrase queries against non-indexed fields should throw an exception (#31060) In the internal highlighter APIs, use the field type as opposed to the mapper. (#31039) [DOCS] Removes duplicated authentication pages Enable customizing REST tests blacklist (#31074) Make sure KeywordFieldMapper#clone preserves split_queries_on_whitespace. (#31049) [DOCS] Moves machine learning overview to stack-docs [ML] Add secondary sort to ML events (#31063) [Rollup] Specialize validation exception for easier management (#30339) Adapt bwc versions after backporting #31045 to 6.3 Remove usage of explicit type in docs (#29667) Share common readFrom/writeTo code in AcknowledgeResponse (#30983) Adapt bwc versions after backporting #31045 to 6.x Mute MatchPhrase*QueryBuilderTests [Docs] Fix typo in watcher conditions documentation (#30989) Remove wrong link in index phrases doc Move pipeline APIs to ingest namespace (#31027) [DOCS] Fixes accounting setting names (#30863) [DOCS] Rewords _field_names documentation (#31029) Index phrases (#30450) Remove leftover debugging from PTCMDT Fix PTCMDT#testMinVersionSerialization Make Persistent Tasks implementations version and feature aware (#31045)
* 6.x: Add TRACE, CONNECT, and PATCH http methods (#31079) Change ObjectParser exception (#31030) Make sure KeywordFieldMapper#clone preserves split_queries_on_whitespace. (#31049) [DOCS] Removes duplicated authentication pages [Rollup] Specialize validation exception for easier management (#30339) Enable customizing REST tests blacklist (#31074) [DOCS] Moves machine learning overview to stack-docs [ML] Add secondary sort to ML events (#31063) QA: Check rollup job creation safety (#31036) Adapt bwc versions after backporting #31045 to 6.3 Remove usage of explicit type in docs (#29667) Move pipeline APIs to ingest namespace (#31027) Adapt bwc versions after backporting #31045 to 6.x Make Persistent Tasks implementations version and feature aware (#31045) Mute MatchPhrase*QueryBuilderTests [Docs] Fix typo in watcher conditions documentation (#30989) Remove wrong link in index phrases doc Reuse expiration date of trial licenses (#31033) Index phrases (#30450) [DOCS] Fixes accounting setting names (#30863) [DOCS] Rewords _field_names documentation (#31029)
The test cannot run when no type is registered. Relates #30450
This commit adds an
index_phrases
option to text fields, which if enabled creates a subsidiary field containing two-term shingles from the main field. Phrase match queries are then redirected to this subsidiary field.