The biggest disadvantage of keeping stopwords is that of performance. When
Elasticsearch performs a full-text search, it has to calculate the relevance
_score
on all matching documents in order to return the top 10 matches.
While most words typically occur in much fewer than 0.1% of all documents, a
few words such as the
may occur in almost all of them. Imagine you have an
index of one million documents. A query for quick brown fox
may match fewer
than 1,000 documents. But a query for the quick brown fox
has to score and
sort almost all of the one million documents in your index, just in order to
return the top 10!
The problem is that the quick brown fox
is really a query for the OR quick
OR brown OR fox
—any document that contains nothing more than the almost
meaningless term the
is included in the result set. What we need is a way of
reducing the number of documents that need to be scored.
The easiest way to reduce the number of documents is simply to use the
and
operator with the match
query, in order
to make all words required.
A match
query like this:
{
"match": {
"text": {
"query": "the quick brown fox",
"operator": "and"
}
}
}
is rewritten as a bool
query like this:
{
"bool": {
"must": [
{ "term": { "text": "the" }},
{ "term": { "text": "quick" }},
{ "term": { "text": "brown" }},
{ "term": { "text": "fox" }}
]
}
}
The bool
query is intelligent enough to execute each term
query in the
optimal order—it starts with the least frequent term. Because all terms
are required, only documents that contain the least frequent term can possibly
match. Using the and
operator greatly speeds up multiterm queries.
In [match-precision], we discussed using the minimum_should_match
operator
to trim the long tail of less-relevant results. It is useful for this purpose
alone but, as a nice side effect, it offers a similar performance benefit to
the and
operator:
{
"match": {
"text": {
"query": "the quick brown fox",
"minimum_should_match": "75%"
}
}
}
In this example, at least three out of the four terms must match. This means that the only docs that need to be considered are those that contain either the least or second least frequent terms.
This offers a huge performance gain over a simple query with the default or
operator! But we can do better yet…