This repository has been archived by the owner on Apr 4, 2023. It is now read-only.
Avoid a prefix-related worst-case scenario in the proximity criterion #733
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request
Related issue
Somewhat fixes (until merged into meilisearch) meilisearch/meilisearch#3118
What does this PR do?
When a query ends with a word and a prefix, such as:
Then we first determine whether
pre
could possibly be in the proximity prefix database before querying it. There are then three possibilities:pr
is not in any prefix cache because it is not the prefix of many words. We don't query the proximity prefix database. Instead, we list all the word derivations ofpre
through the FST and query the regular proximity databases.pr
is in the prefix cache but cannot be found in the proximity prefix databases. In this case, we partially disable the proximity ranking rule for the pairword pre
. This is done as follows:word
is in proximity topre
exactly (no derivations)pr
is in the prefix cache and can be found in the proximity prefix databases. In this case we simply query the proximity prefix databases.Note that if a prefix is longer than 2 bytes, then it cannot be in the proximity prefix databases. Also, proximities larger than 4 are not present in these databases either. Therefore, the impact on relevancy is:
Regarding (1), it means that these two documents would be considered equally relevant according to the proximity rule for the query
heard pr
(IFpr
is the prefix of more than 200 words in the dataset):Regarding (2), it means that two documents would be considered equally relevant according to the proximity rule for the query "faster pro":
But the following document would be considered more relevant than the two documents above:
Note, however, that this change of behaviour only occurs when using the set-based version of the proximity criterion. In cases where there are fewer than 1000 candidate documents when the proximity criterion is called, this PR does not change anything.
Performance
I couldn't use the existing search benchmarks to measure the impact of the PR, but I did some manual tests with the
songs
benchmark dataset.Performance is often significantly better, but there is also one regression in the set-based implementation with the query
b b b b b b b b b b
.