Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index phrases #30450

Merged
merged 24 commits into from
Jun 4, 2018
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
e39d396
Store offsets in index prefix fields when stored in the parent field
romseygeek Mar 14, 2018
e931107
Merge branch 'master' into index-phrases
romseygeek Apr 12, 2018
3719e30
Merge branch 'master' into index-phrases
romseygeek Apr 24, 2018
4cea300
Upgrade to lucene 7.4.0 snapshot
romseygeek Apr 24, 2018
efd612a
Merge branch 'master' into index-phrases
romseygeek May 2, 2018
b591fc4
Add the ability to index two-term shingles for faster phrase queries
romseygeek May 8, 2018
aca2b7e
docs
romseygeek May 8, 2018
910f0c1
Merge branch 'master' into index-phrases
romseygeek May 8, 2018
e0fe29d
Placate checkstyle
romseygeek May 8, 2018
c156cbd
changelog
romseygeek May 8, 2018
007ee3d
Merge branch 'master' into index-phrases
romseygeek May 18, 2018
66b1e48
MappedFieldType.matchQuery -> .analyzePhrase
romseygeek May 22, 2018
4d6cb66
Merge remote-tracking branch 'origin/master' into index-phrases
romseygeek May 22, 2018
9eb8a6d
iter
romseygeek May 24, 2018
69cf210
Merge remote-tracking branch 'origin/master' into index-phrases
romseygeek May 24, 2018
0deebb6
iter
romseygeek May 24, 2018
6cfa4b1
iter
romseygeek May 29, 2018
6208c31
Merge branch 'master' into index-phrases
romseygeek May 29, 2018
1d7852e
Check for positions
romseygeek May 29, 2018
289186e
Merge branch 'master' into index-phrases
romseygeek May 30, 2018
518e280
Merge branch 'master' into index-phrases
romseygeek May 30, 2018
8dd5cd5
index_phrases isn't an updateable setting
romseygeek May 30, 2018
dab97ad
Merge remote-tracking branch 'origin/master' into index-phrases
romseygeek Jun 1, 2018
b2e732a
Merge remote-tracking branch 'origin/master' into index-phrases
romseygeek Jun 1, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/CHANGELOG.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,9 @@ option. ({pull}30140[#29658])
A new analysis plugin called `analysis_nori` that exposes the Lucene Korean
analysis module. ({pull}30397[#30397])

Text fields now have an `index_phrases` parameter that allows faster phrase queries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a non native speaker it feels weird that index_prefix is singular while index_phrases is plural?

at the expense of a larger index ({pull}30450[#30450])

[float]
=== Enhancements

Expand Down
6 changes: 6 additions & 0 deletions docs/reference/mapping/types/text.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,12 @@ The following parameters are accepted by `text` fields:
the expense of a larger index. Accepts an
<<index-prefix-config,`index-prefix configuration block`>>

<<index-phrases,`index_phrases`>>::

If enabled, two-term word combinations ('shingles') are indexed into a separate
field. This allows phrase queries to run more efficiently, at the expense
of a larger index. Accepts `true` or `false` (default).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should document that this feature works better when stop words are not removed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


<<norms,`norms`>>::

Whether field-length should be taken into account when scoring queries.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ private ICUTokenizerConfig getIcuConfig(Environment env, Settings settings) {
if (tailored.isEmpty()) {
return null;
} else {
final BreakIterator breakers[] = new BreakIterator[UScript.CODE_LIMIT];
final RuleBasedBreakIterator breakers[] = new RuleBasedBreakIterator[UScript.CODE_LIMIT];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is already in master

for (Map.Entry<Integer, String> entry : tailored.entrySet()) {
int code = entry.getKey();
String resourcePath = entry.getValue();
Expand All @@ -105,7 +105,7 @@ public RuleBasedBreakIterator getBreakIterator(int script) {
}

//parse a single RBBi rule file
private BreakIterator parseRules(String filename, Environment env) throws IOException {
private RuleBasedBreakIterator parseRules(String filename, Environment env) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here


final Path path = env.configFile().resolve(filename);
String rules = Files.readAllLines(path)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
"search with indexed phrases":
- skip:
version: " - 6.99.99"
reason: index_phrase is only available as of 7.0.0
- do:
indices.create:
index: test
body:
mappings:
test:
properties:
text:
type: text
index_phrases: true

- do:
index:
index: test
type: test
id: 1
body: { text: "peter piper picked a peck of pickled peppers" }

- do:
indices.refresh:
index: [test]

- do:
search:
index: test
body:
query:
match_phrase:
text:
query: "peter piper"

- match: {hits.total: 1}

- do:
search:
index: test
q: '"peter piper"~1'
df: text

- match: {hits.total: 1}

- do:
search:
index: test
body:
query:
match_phrase:
text: "peter piper picked"

- match: {hits.total: 1}

- do:
search:
index: test
body:
query:
match_phrase:
text: "piper"

- match: {hits.total: 1}


Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
import org.elasticsearch.index.query.QueryRewriteContext;
import org.elasticsearch.index.query.QueryShardContext;
import org.elasticsearch.index.query.QueryShardException;
import org.elasticsearch.index.search.MatchQuery;
import org.elasticsearch.index.similarity.SimilarityProvider;
import org.elasticsearch.search.DocValueFormat;
import org.joda.time.DateTimeZone;
Expand Down Expand Up @@ -360,6 +361,10 @@ public Query nullValueQuery() {

public abstract Query existsQuery(QueryShardContext context);

public MatchQuery matchQuery(QueryShardContext context, String analyzer, int slop) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be changed to analyzePhrase(QueryShardContext context, String text, Analyzer analyzer, int slop) ?
It seems to much to allow this override so I'd prefer something more focus on the real purpose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or even further: phraseQuery(BytesRef[] terms, int[] positions), ie. only a factory for phrase queries and still leave analysis up to the query parser

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I think the FixedShingleFilter should be able to produce the shingles from the terms and positions even if there are gaps in the phrase query (produced by stop filter).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi I would be cautious about that. This optimization should return the same hits as a phrase query on the original field. On the regular field, a query for "the fox" would be parsed as a term query on fox, so it feels wrong to me that we run it as a term query on _ fox on the shingle field, as it would have different matches. I'd rather like to document that this option can't optimize queries with stop words.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I agree this is trappy so +1 to disable this optim when gaps are detected and to add documentation about this limitation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the signature to analyzePhrase(String, TokenStream, int), and copied the logic from QueryBuilder#analyzePhrase, which simplifies things a lot. Stop words should be handled automatically, because QueryBuilder itself only calls analyzePhrase if the tokenstream has length > 1. So "the fox" would be analyzed to just fox and parsed as a term query, while "Reynard the fox" would get converted to a phrase query for "reynard _" "_ fox"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might not want the latter, since the phrase query on the regular field would match everything that has fox 2 positions after Reynard, while a phrase query for "reynard _" "_ fox" adds the additional constraint that there is no token between them. For instance "Reynard cunning fox" would match on the regular field but not on the shingle field.

Even though this might better fit the original intention of the user, I'd like this feature to be as transparent as possible and not change the set of matches for a given query?

throw new QueryShardException(context, "Can only use match queries on keyword and text fields - not on [" + name + "] which is of type [" + typeName() + "]");
}

/**
* An enum used to describe the relation between the range of terms in a
* shard when compared with a query range
Expand Down
Loading