Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Avoid a prefix-related worst-case scenario in the proximity criterion #733

Merged
merged 3 commits into from
Jan 4, 2023

Conversation

loiclec
Copy link
Contributor

@loiclec loiclec commented Dec 8, 2022

Pull Request

Related issue

Somewhat fixes (until merged into meilisearch) meilisearch/meilisearch#3118

What does this PR do?

When a query ends with a word and a prefix, such as:

word pr

Then we first determine whether pre could possibly be in the proximity prefix database before querying it. There are then three possibilities:

  1. pr is not in any prefix cache because it is not the prefix of many words. We don't query the proximity prefix database. Instead, we list all the word derivations of pre through the FST and query the regular proximity databases.

  2. pr is in the prefix cache but cannot be found in the proximity prefix databases. In this case, we partially disable the proximity ranking rule for the pair word pre. This is done as follows:

    1. Only find the documents where word is in proximity to pre exactly (no derivations)
    2. Otherwise, assume that their proximity in all the documents in which they coexist is >= 8
  3. pr is in the prefix cache and can be found in the proximity prefix databases. In this case we simply query the proximity prefix databases.

Note that if a prefix is longer than 2 bytes, then it cannot be in the proximity prefix databases. Also, proximities larger than 4 are not present in these databases either. Therefore, the impact on relevancy is:

  1. For common prefixes of one or two letters: we no longer distinguish between proximities from 4 to 8
  2. For common prefixes of more than two letters: we no longer distinguish between any proximities
  3. For uncommon prefixes: nothing changes

Regarding (1), it means that these two documents would be considered equally relevant according to the proximity rule for the query heard pr (IF pr is the prefix of more than 200 words in the dataset):

[
    { "text": "I heard there is a faster proximity criterion" },
    { "text": "I heard there is a faster but less relevant proximity criterion" }
]

Regarding (2), it means that two documents would be considered equally relevant according to the proximity rule for the query "faster pro":

[
    { "text": "I heard there is a faster but less relevant proximity criterion" }
    { "text": "I heard there is a faster proximity criterion" },
]

But the following document would be considered more relevant than the two documents above:

{ "text": "I heard there is a faster swimmer who is competing in the pro section of the competition " }

Note, however, that this change of behaviour only occurs when using the set-based version of the proximity criterion. In cases where there are fewer than 1000 candidate documents when the proximity criterion is called, this PR does not change anything.


Performance

I couldn't use the existing search benchmarks to measure the impact of the PR, but I did some manual tests with the songs benchmark dataset.

1. 10x 'a': 
	- 640ms ⟹ 630ms                  = no significant difference
2. 10x 'b':
	- set-based: 4.47s ⟹ 7.42        = bad, ~2x regression
	- dynamic: 1s ⟹ 870 ms           = no significant difference
3. 'Someone I l':
	- set-based: 250ms ⟹ 12 ms       = very good, x20 speedup
	- dynamic: 21ms ⟹ 11 ms          = good, x2 speedup 
4. 'billie e':
	- set-based: 623ms ⟹ 2ms         = very good, x300 speedup 
	- dynamic: ~4ms ⟹ 4ms            = no difference
5. 'billie ei':
	- set-based: 57ms ⟹ 20ms         = good, ~2x speedup
	- dynamic: ~4ms ⟹ ~2ms.          = no significant difference
6. 'i am getting o' 
	- set-based: 300ms ⟹ 60ms        = very good, 5x speedup
	- dynamic: 30ms ⟹ 6ms            = very good, 5x speedup
7. 'prologue 1 a 1:
	- set-based: 3.36s ⟹ 120ms       = very good, 30x speedup
	- dynamic: 200ms ⟹ 30ms          = very good, 6x speedup
8. 'prologue 1 a 10':
	- set-based: 590ms ⟹ 18ms        = very good, 30x speedup 
	- dynamic: 82ms ⟹ 35ms           = good, ~2x speedup

Performance is often significantly better, but there is also one regression in the set-based implementation with the query b b b b b b b b b b.

@loiclec loiclec marked this pull request as draft December 8, 2022 11:57
@loiclec loiclec added no breaking The related changes are not breaking (DB nor API) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption labels Dec 8, 2022
@loiclec loiclec force-pushed the proximity-prefix-perf branch from 6fafc3e to f097aaf Compare December 22, 2022 11:10
@loiclec loiclec requested a review from Kerollmops January 2, 2023 13:02
@loiclec loiclec marked this pull request as ready for review January 2, 2023 13:02
Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me. It is mergeable, performance gains are very good 🎉
Nice job on that!

However, the regression from 4s to 7s is intriguing me. Why do 10x 'b' kill the engine? Were you able to identify the issue?

@loiclec
Copy link
Contributor Author

loiclec commented Jan 4, 2023

I am going to merge it now, because we want a milli release today, but I'll start my investigation of the performance regression today :)
bors merge

@bors
Copy link
Contributor

bors bot commented Jan 4, 2023

@bors bors bot merged commit c3f4835 into main Jan 4, 2023
@bors bors bot deleted the proximity-prefix-perf branch January 4, 2023 09:15
@loiclec
Copy link
Contributor Author

loiclec commented Jan 5, 2023

The performance regression was due to a mistake in the previous implementation, which only considered the rightward proximity between the left and right word in some cases.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
no breaking The related changes are not breaking (DB nor API) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants