Avoid a prefix-related worst-case scenario in the proximity criterion #733

loiclec · 2022-12-08T11:57:05Z

Pull Request

Related issue

Somewhat fixes (until merged into meilisearch) meilisearch/meilisearch#3118

What does this PR do?

When a query ends with a word and a prefix, such as:

word pr

Then we first determine whether pre could possibly be in the proximity prefix database before querying it. There are then three possibilities:

pr is not in any prefix cache because it is not the prefix of many words. We don't query the proximity prefix database. Instead, we list all the word derivations of pre through the FST and query the regular proximity databases.
pr is in the prefix cache but cannot be found in the proximity prefix databases. In this case, we partially disable the proximity ranking rule for the pair word pre. This is done as follows:
1. Only find the documents where word is in proximity to pre exactly (no derivations)
2. Otherwise, assume that their proximity in all the documents in which they coexist is >= 8
pr is in the prefix cache and can be found in the proximity prefix databases. In this case we simply query the proximity prefix databases.

Note that if a prefix is longer than 2 bytes, then it cannot be in the proximity prefix databases. Also, proximities larger than 4 are not present in these databases either. Therefore, the impact on relevancy is:

For common prefixes of one or two letters: we no longer distinguish between proximities from 4 to 8
For common prefixes of more than two letters: we no longer distinguish between any proximities
For uncommon prefixes: nothing changes

Regarding (1), it means that these two documents would be considered equally relevant according to the proximity rule for the query heard pr (IF pr is the prefix of more than 200 words in the dataset):

[
    { "text": "I heard there is a faster proximity criterion" },
    { "text": "I heard there is a faster but less relevant proximity criterion" }
]

Regarding (2), it means that two documents would be considered equally relevant according to the proximity rule for the query "faster pro":

[
    { "text": "I heard there is a faster but less relevant proximity criterion" }
    { "text": "I heard there is a faster proximity criterion" },
]

But the following document would be considered more relevant than the two documents above:

{ "text": "I heard there is a faster swimmer who is competing in the pro section of the competition " }

Note, however, that this change of behaviour only occurs when using the set-based version of the proximity criterion. In cases where there are fewer than 1000 candidate documents when the proximity criterion is called, this PR does not change anything.

Performance

I couldn't use the existing search benchmarks to measure the impact of the PR, but I did some manual tests with the songs benchmark dataset.

1. 10x 'a': 
	- 640ms ⟹ 630ms                  = no significant difference
2. 10x 'b':
	- set-based: 4.47s ⟹ 7.42        = bad, ~2x regression
	- dynamic: 1s ⟹ 870 ms           = no significant difference
3. 'Someone I l':
	- set-based: 250ms ⟹ 12 ms       = very good, x20 speedup
	- dynamic: 21ms ⟹ 11 ms          = good, x2 speedup 
4. 'billie e':
	- set-based: 623ms ⟹ 2ms         = very good, x300 speedup 
	- dynamic: ~4ms ⟹ 4ms            = no difference
5. 'billie ei':
	- set-based: 57ms ⟹ 20ms         = good, ~2x speedup
	- dynamic: ~4ms ⟹ ~2ms.          = no significant difference
6. 'i am getting o' 
	- set-based: 300ms ⟹ 60ms        = very good, 5x speedup
	- dynamic: 30ms ⟹ 6ms            = very good, 5x speedup
7. 'prologue 1 a 1:
	- set-based: 3.36s ⟹ 120ms       = very good, 30x speedup
	- dynamic: 200ms ⟹ 30ms          = very good, 6x speedup
8. 'prologue 1 a 10':
	- set-based: 590ms ⟹ 18ms        = very good, 30x speedup 
	- dynamic: 82ms ⟹ 35ms           = good, ~2x speedup

Performance is often significantly better, but there is also one regression in the set-based implementation with the query b b b b b b b b b b.

Kerollmops

The PR looks good to me. It is mergeable, performance gains are very good 🎉
Nice job on that!

However, the regression from 4s to 7s is intriguing me. Why do 10x 'b' kill the engine? Were you able to identify the issue?

loiclec · 2023-01-04T09:00:34Z

I am going to merge it now, because we want a milli release today, but I'll start my investigation of the performance regression today :)
bors merge

bors · 2023-01-04T09:15:34Z

Build succeeded:

loiclec · 2023-01-05T11:37:26Z

The performance regression was due to a mistake in the previous implementation, which only considered the rightward proximity between the left and right word in some cases.

loiclec marked this pull request as draft December 8, 2022 11:57

loiclec added no breaking The related changes are not breaking (DB nor API) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption labels Dec 8, 2022

loiclec mentioned this pull request Dec 14, 2022

Search performance issues meilisearch/meilisearch#3111

Closed

8 tasks

loiclec added 2 commits December 22, 2022 12:08

Avoid a prefix-related worst-case scenario in the proximity criterion

777b387

Add unit test for prefix handling by the proximity criterion

f097aaf

loiclec force-pushed the proximity-prefix-perf branch from 6fafc3e to f097aaf Compare December 22, 2022 11:10

Add explicit criterion impl strategy to proximity search tests

8d36570

loiclec requested a review from Kerollmops January 2, 2023 13:02

loiclec marked this pull request as ready for review January 2, 2023 13:02

Kerollmops approved these changes Jan 3, 2023

View reviewed changes

bors bot merged commit c3f4835 into main Jan 4, 2023

bors bot deleted the proximity-prefix-perf branch January 4, 2023 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid a prefix-related worst-case scenario in the proximity criterion #733

Avoid a prefix-related worst-case scenario in the proximity criterion #733

loiclec commented Dec 8, 2022 •

edited

Loading

Kerollmops left a comment

loiclec commented Jan 4, 2023

bors bot commented Jan 4, 2023

loiclec commented Jan 5, 2023

Avoid a prefix-related worst-case scenario in the proximity criterion #733

Avoid a prefix-related worst-case scenario in the proximity criterion #733

Conversation

loiclec commented Dec 8, 2022 • edited Loading

Pull Request

Related issue

What does this PR do?

Performance

Kerollmops left a comment

Choose a reason for hiding this comment

loiclec commented Jan 4, 2023

bors bot commented Jan 4, 2023

loiclec commented Jan 5, 2023

loiclec commented Dec 8, 2022 •

edited

Loading