Skip to content

Commit

Permalink
Update the how-to section of the docs for 7.0: (#37717)
Browse files Browse the repository at this point in the history
 - new `rank_feature`/`script_score` queries
 - new `index_phrases`/`index_prefixes` options
 - disabling `_field_names` doesn't help anymore
 - adaptive replica selection is on by default
  • Loading branch information
jpountz committed Mar 12, 2019
1 parent da1e5cd commit 9305056
Show file tree
Hide file tree
Showing 4 changed files with 141 additions and 21 deletions.
7 changes: 0 additions & 7 deletions docs/reference/how-to/indexing-speed.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -114,13 +114,6 @@ The default is `10%` which is often plenty: for example, if you give the JVM
10GB of memory, it will give 1GB to the index buffer, which is enough to host
two shards that are heavily indexing.

[float]
=== Disable `_field_names`

The <<mapping-field-names-field,`_field_names` field>> introduces some
index-time overhead, so you might want to disable it if you never need to
run `exists` queries.

[float]
=== Additional optimizations

Expand Down
6 changes: 3 additions & 3 deletions docs/reference/how-to/recipes.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@

This section includes a few recipes to help with common problems:

* <<mixing-exact-search-with-stemming>>
* <<consistent-scoring>>
* <<mixing-exact-search-with-stemming,Mixing exact search with stemming>>
* <<consistent-scoring,Getting consistent scores>>
* <<static-scoring-signals,Incorporating static relevance signals into the score>>

include::recipes/stemming.asciidoc[]
include::recipes/scoring.asciidoc[]

126 changes: 124 additions & 2 deletions docs/reference/how-to/recipes/scoring.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -60,8 +60,8 @@ request do not have similar index statistics and relevancy could be bad.

If you have a small dataset, the easiest way to work around this issue is to
index everything into an index that has a single shard
(`index.number_of_shards: 1`). Then index statistics will be the same for all
documents and scores will be consistent.
(`index.number_of_shards: 1`), which is the default. Then index statistics
will be the same for all documents and scores will be consistent.

Otherwise the recommended way to work around this issue is to use the
<<dfs-query-then-fetch,`dfs_query_then_fetch`>> search type. This will make
Expand All @@ -78,3 +78,125 @@ queries, beware that gathering statistics alone might not be cheap since all
terms have to be looked up in the terms dictionaries in order to look up
statistics.

[[static-scoring-signals]]
=== Incorporating static relevance signals into the score

Many domains have static signals that are known to be correlated with relevance.
For instance https://en.wikipedia.org/wiki/PageRank[PageRank] and url length are
two commonly used features for web search in order to tune the score of web
pages independently of the query.

There are two main queries that allow combining static score contributions with
textual relevance, eg. as computed with BM25:
- <<query-dsl-script-score-query,`script_score` query>>
- <<query-dsl-rank-feature-query,`rank_feature` query>>

For instance imagine that you have a `pagerank` field that you wish to
combine with the BM25 score so that the final score is equal to
`score = bm25_score + pagerank / (10 + pagerank)`.

With the <<query-dsl-script-score-query,`script_score` query>> the query would
look like this:

//////////////////////////
[source,js]
--------------------------------------------------
PUT index
{
"mappings": {
"properties": {
"body": {
"type": "text"
},
"pagerank": {
"type": "long"
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST
//////////////////////////

[source,js]
--------------------------------------------------
GET index/_search
{
"query" : {
"script_score" : {
"query" : {
"match": { "body": "elasticsearch" }
},
"script" : {
"source" : "_score * saturation(doc['pagerank'].value, 10)" <1>
}
}
}
}
--------------------------------------------------
// CONSOLE
//TEST[continued]
<1> `pagerank` must be mapped as a <<number>>

while with the <<query-dsl-rank-feature-query,`rank_feature` query>> it would
look like below:

//////////////////////////
[source,js]
--------------------------------------------------
PUT index
{
"mappings": {
"properties": {
"body": {
"type": "text"
},
"pagerank": {
"type": "rank_feature"
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST
//////////////////////////

[source,js]
--------------------------------------------------
GET _search
{
"query" : {
"bool" : {
"must": {
"match": { "body": "elasticsearch" }
},
"should": {
"rank_feature": {
"field": "pagerank", <1>
"saturation": {
"pivot": 10
}
}
}
}
}
}
--------------------------------------------------
// CONSOLE
<1> `pagerank` must be mapped as a <<rank-feature,`rank_feature`>> field

While both options would return similar scores, there are trade-offs:
<<query-dsl-script-score-query,script_score>> provides a lot of flexibility,
enabling you to combine the text relevance score with static signals as you
prefer. On the other hand, the <<rank-feature,`rank_feature` query>> only
exposes a couple ways to incorporate static signails into the score. However,
it relies on the <<rank-feature,`rank_feature`>> and
<<rank-features,`rank_features`>> fields, which index values in a special way
that allows the <<query-dsl-rank-feature-query,`rank_feature` query>> to skip
over non-competitive documents and get the top matches of a query faster.
23 changes: 14 additions & 9 deletions docs/reference/how-to/search-speed.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -395,15 +395,6 @@ be able to cope with `max_failures` node failures at once at most, then the
right number of replicas for you is
`max(max_failures, ceil(num_nodes / num_primaries) - 1)`.

[float]
=== Turn on adaptive replica selection

When multiple copies of data are present, elasticsearch can use a set of
criteria called <<search-adaptive-replica,adaptive replica selection>> to select
the best copy of the data based on response time, service time, and queue size
of the node containing each copy of the shard. This can improve query throughput
and reduce latency for search-heavy applications.

=== Tune your queries with the Profile API

You can also analyse how expensive each component of your queries and
Expand All @@ -419,3 +410,17 @@ Some caveats to the Profile API are that:
- the Profile API as a debugging tool adds significant overhead to search execution and can also have a very verbose output
- given the added overhead, the resulting took times are not reliable indicators of actual took time, but can be used comparatively between clauses for relative timing differences
- the Profile API is best for exploring possible reasons behind the most costly clauses of a query but isn't intended for accurately measuring absolute timings of each clause

=== Faster phrase queries with `index_phrases`

The <<text,`text`>> field has an <<index-phrases,`index_phrases`>> option that
indexes 2-shingles and is automatically leveraged by query parsers to run phrase
queries that don't have a slop. If your use-case involves running lots of phrase
queries, this can speed up queries significantly.

=== Faster prefix queries with `index_prefixes`

The <<text,`text`>> field has an <<index-phrases,`index_prefixes`>> option that
indexes prefixes of all terms and is automatically leveraged by query parsers to
run prefix queries. If your use-case involves running lots of prefix queries,
this can speed up queries significantly.

0 comments on commit 9305056

Please sign in to comment.