Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Prefixes and Merges #5124

Merged
merged 6 commits into from
Dec 5, 2024

Conversation

Kerollmops
Copy link
Member

@Kerollmops Kerollmops commented Dec 4, 2024

In this PR, we plan to optimize the read of LMDB to use read the entries in lexicographic order and better use the memory-mapping OS cache:

  • Optimize the prefix generation for word position docids (@ManyTheFish)
  • Optimize the parallel merging of the caches to sort entries before merging the caches (@Kerollmops)

Benchmarks on 1cpu 2gb gpo3 (5k IOps)

Before on the tag meilisearch-v1.12.0-rc.3.

word_position_docids:merge_and_send_docids: 988s
compute_word_fst: 23.3s
word_pair_proximity_docids:merge_and_send_docids: 428s
compute_word_prefix_fid_docids:recompute_modified_prefixes: 76.3s
compute_word_prefix_position_docids:recompute_modified_prefixes:from_prefixes: 429s

After sorting the whole HashMaps in a Vec on this branch.

word_position_docids:merge_and_send_docids: 202s
compute_word_fst: 20.4s
word_pair_proximity_docids:merge_and_send_docids: 427s
compute_word_prefix_fid_docids:recompute_modified_prefixes: 65.5s
compute_word_prefix_position_docids:recompute_modified_prefixes:from_prefixes: 62.5s

@Kerollmops
Copy link
Member Author

/bench workloads/hackernews-add-new-documents.json workloads/hackernews-modify-*

ManyTheFish
ManyTheFish previously approved these changes Dec 4, 2024
ManyTheFish
ManyTheFish previously approved these changes Dec 4, 2024
@Kerollmops
Copy link
Member Author

/bench workloads/movies.json workloads/hackernews.json

@Kerollmops Kerollmops added this to the v1.12.0 milestone Dec 4, 2024
@meili-bot
Copy link
Contributor

@Kerollmops
Copy link
Member Author

bors merge

@Kerollmops Kerollmops marked this pull request as ready for review December 4, 2024 17:17
@Kerollmops
Copy link
Member Author

bors merge

Copy link
Contributor

meili-bors bot commented Dec 4, 2024

Already running a review

@Kerollmops
Copy link
Member Author

Kerollmops commented Dec 4, 2024

I want to change the name of the merge_alt function and perform more benchmarks on larger machines.

@Kerollmops
Copy link
Member Author

bors cancel

Copy link
Contributor

meili-bors bot commented Dec 4, 2024

Canceled.

@Kerollmops Kerollmops marked this pull request as draft December 4, 2024 17:19
Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors merge

@Kerollmops
Copy link
Member Author

bors merge

Copy link
Contributor

meili-bors bot commented Dec 5, 2024

Already running a review

Copy link
Contributor

meili-bors bot commented Dec 5, 2024

@meili-bors meili-bors bot merged commit cac355b into release-v1.12.0 Dec 5, 2024
10 checks passed
@meili-bors meili-bors bot deleted the optimize-prefixes-and-merges branch December 5, 2024 10:11
meili-bors bot added a commit that referenced this pull request Dec 5, 2024
5125: Change the default max memory usage to 5% of the total memory r=ManyTheFish a=Kerollmops

After thorough testing, we found that giving 5% of the total available memory to allocate resident memory (caches and channels) is the best approach.

The main reason is that the new indexer is highly memory-map oriented, with LMDB, and reads the database while performing the indexation. So, by allowing the maximum amount of memory available to LMDB and the OS, it will perform the key-value store reads and all other indexation operations faster by keeping more pages hot in the cache. In #5124, we also sorted the entries to merge to improve the read speed of LMDB.

This is common in database management systems: Reading stuff on the disk is much faster when done in lexicographic order (the default sorted order of key values). The entries have a great chance of already being in the OS memory cache, as they were loaded in a previous read, and reading stuff on the disk is very slow compared to reading memory.

Co-authored-by: Kerollmops <clement@meilisearch.com>
@meili-bot meili-bot added the v1.12.0 PRs/issues solved in v1.12.0 released on 2024-12-23 label Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.12.0 PRs/issues solved in v1.12.0 released on 2024-12-23
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants