feat: Adds German compound words decomposition with new segmenter #303

luflow · 2024-08-09T22:49:19Z

Pull Request

What does this PR do?

Adds first version of decomposition for german compound words based on a dictionary (based on https://github.com/uschindler/german-decompounder/)
Adds benchmark with german sentences

PR checklist

Please check if your PR fulfills the following requirements:

Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
Have you read the contributing guidelines?
Have you made sure that the title is accurate and descriptive of the changes?

luflow · 2024-08-09T22:52:32Z

I assume this could be a very expensive algorithm because all word lengths are checked against the dict?

Not sure if there is a better solution, but at least a first version for compound words :)

luflow · 2024-08-10T12:52:22Z

Also another open question: can we even use the dictionary?

The orignal author has it under GNU GPL
https://github.com/uschindler/german-decompounder/blob/master/NOTICE.txt

luflow · 2024-08-12T12:10:20Z

@curquiza @ManyTheFish fixed the fmt and clippy issues, Please rerun

ManyTheFish

Hello @luflow,

Could you add a feature flag on your implementation as I suggested, please? Then add it as a default feature in the Cargo.toml file.

In terms of implementation, you chose to rely on an HashSet to split your words, but I don't think it's the best approach.
I highly suggest using an FstSegmenter like in the Thai tokenizer, it's a bit complex to build but way more efficient in time and space, or you could use an AhoCorasick automaton using the LeftmostLongest match kind.

Sorry for the delays!
Let me know if you have a question

charabia/src/segmenter/mod.rs

luflow · 2024-08-27T08:28:09Z

Hi @ManyTheFish!

Do you have any instructions to build the fst file? I could not find any material online - especially because FST is also used in other contexts like R but does something totally different 🤣

Otherwise the leftmostmatch functionality also works with a word dictionary if i understand it correctly?

ManyTheFish · 2024-08-27T12:07:36Z

Do you have any instructions to build the fst file? I could not find any material online - especially because FST is also used in other contexts like R but does something totally different 🤣

You can use the CLI fst-bin to build your dictionary from a source file. 😄

Otherwise the leftmostmatch functionality also works with a word dictionary if i understand it correctly?

Yes you can build it from an iterator over str, so it's convenient

…and min lemma length definition

luflow · 2024-08-28T11:18:36Z

@ManyTheFish I extended the FstSegmenter with two options to also be able to handle a min lemma length and being able to hinder the segmenter from spitting out single letters. That keeps my dictionary even smaller and may be also useful for other languages later?

The dictionary is now also transformed into an FST file.

Let me know what you think :)

luflow · 2024-09-07T16:49:51Z

@ManyTheFish dud you find time yet to look over the changes? Do you need anything else from my side? :)

ManyTheFish

Hello @luflow,
sorry for the delay, LGTM!

bors merge

303: feat: Adds German compound words decomposition with new segmenter r=ManyTheFish a=luflow # Pull Request ## What does this PR do? - Adds first version of decomposition for german compound words based on a dictionary (based on https://github.com/uschindler/german-decompounder/) - Adds benchmark with german sentences ## PR checklist Please check if your PR fulfills the following requirements: - [X] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [X] Have you read the contributing guidelines? - [X] Have you made sure that the title is accurate and descriptive of the changes? Co-authored-by: Florian Ludwig <florian.ludwig@uninow.de> Co-authored-by: Florian Ludwig <florian@krautnerds.de>

meili-bors · 2024-09-09T08:17:31Z

Build failed:

tests

charabia/src/segmenter/utils.rs

Co-authored-by: Many the fish <many@meilisearch.com>

luflow · 2024-09-09T09:54:17Z

@ManyTheFish ok applied suggestion :)

ManyTheFish · 2024-09-09T11:32:46Z

Hello @luflow,

the test and clippy are not happy,

could you ensure that:

cargo clippy
cargo test

work on your machine please?

I'll merge as soon as the tests pass 😃

luflow · 2024-09-09T14:36:01Z

@ManyTheFish done 👍🏻

ManyTheFish

Nice!

Thank you for the contribution!

bors merge

meili-bors · 2024-09-10T09:14:04Z

Build succeeded:

This PR contains the following updates: | Package | Update | Change | |---|---|---| | [getmeili/meilisearch](https://redirect.github.com/meilisearch/meilisearch) | minor | `v1.10.3` -> `v1.12.1` | --- > [!WARNING] > Some dependencies could not be looked up. Check the Dependency Dashboard for more information. --- ### Release Notes <details> <summary>meilisearch/meilisearch (getmeili/meilisearch)</summary> ### [`v1.12.1`](https://redirect.github.com/meilisearch/meilisearch/releases/tag/v1.12.1) [Compare Source](https://redirect.github.com/meilisearch/meilisearch/compare/v1.12.0...v1.12.1) #### Fixes There was a bug in the engine when adding an empty payload, it was making the batch fails. Fixed by [@irevoire](https://redirect.github.com/irevoire) in [https://github.com/meilisearch/meilisearch/pull/5192](https://redirect.github.com/meilisearch/meilisearch/pull/5192) **Full Changelog**: meilisearch/meilisearch@v1.12.0...v1.12.1 ### [`v1.12.0`](https://redirect.github.com/meilisearch/meilisearch/releases/tag/v1.12.0): 🦗 [Compare Source](https://redirect.github.com/meilisearch/meilisearch/compare/v1.11.3...v1.12.0) Meilisearch v1.12 introduces significant indexing speed improvements, almost halving the time required to index large datasets. This release also introduces new settings to customize and potentially further increase indexing speed. 🧰 All official Meilisearch integrations (including SDKs, clients, and other tools) are compatible with this Meilisearch release. Integration deployment happens between 4 to 48 hours after a new version becomes available. Some SDKs might not include all new features. Consult the project repository for detailed information. Is a feature you need missing from your chosen SDK? Create an issue letting us know you need it, or, for open-source karma points, open a PR implementing it (we'll love you for that ❤️). ### New features and updates 🔥 #### Improve indexing speed Indexing time is improved across the board! - Performance is maintained or better on smaller machines - On bigger machines with multiple cores and good IO, Meilisearch v1.12 is much faster than Meilisearch v1.11 - More than twice as fast for raw document insertion tasks. - More than x4 as fast for incrementally updating documents in a large database. - Embeddings generation was also improved up to x1.5 for some workloads. The new indexer also makes task cancellation faster. Done by [@dureuill](https://redirect.github.com/dureuill), [@ManyTheFish](https://redirect.github.com/ManyTheFish), and [@Kerollmops](https://redirect.github.com/Kerollmops) in [#4900](https://redirect.github.com/meilisearch/meilisearch/issues/4900). #### New index settings: use `facetSearch` and `prefixSearch` to improve indexing speed v1.12 introduces two new index settings: `facetSearch` and `prefixSearch`. Both settings allow you to skip parts of the indexing process. This leads to significant improvements to indexing speed, but may negatively impact search experience in some use cases. Done by [@ManyTheFish](https://redirect.github.com/ManyTheFish) in [#5091](https://redirect.github.com/meilisearch/meilisearch/issues/5091) ##### `facetSearch` Use this setting to toggle [facet search](https://www.meilisearch.com/docs/learn/filtering_and_sorting/search_with_facet_filters#searching-facet-values): ```bash curl \ -X PUT 'http://localhost:7700/indexes/books/settings/facet-search' \ -H 'Content-Type: application/json' \ --data-binary 'true' ``` The default value for `facetSearch` is `true`. When set to `false`, this setting disables facet search for all filterable attributes in an index. ##### `prefixSearch` Use this setting to configure the ability to [search a word by prefix](https://www.meilisearch.com/docs/learn/engine/prefix) on an index: ```bash curl \ -X PUT 'http://localhost:7700/indexes/books/settings/prefix-search' \ -H 'Content-Type: application/json' \ --data-binary 'disabled' ``` `prefixSearch` accepts one of the following values: - `"indexingTime"`: enables prefix processing during indexing. This is the default Meilisearch behavior - `"disabled"`: deactivates prefix search completely Disabling prefix search means the query `he` will no longer match the word `hello`. This may significantly impact search result relevancy, but speeds up the indexing process. #### New API route: `/batches` The new `/batches` endpoint allow you to query information about task batches. `GET` `/batches` returns a list of batch objects: ```sh curl -X GET 'http://localhost:7700/batches' ``` This endpoint accepts the same parameters as `GET` `/tasks` route, allowing you to narrow down which batches you want to see. Parameters used with `GET` `/batches` apply to the tasks, not the batches themselves. For example, `GET /batches?uid=0` returns batches containing tasks with a `taskUid` of `0` , not batches with a `batchUid` of `0`. You may also query `GET` `/batches/:uid` to retrieve information about a single batch object: ```sh curl -X GET 'http://localhost:7700/batches/BATCH_UID' ``` `/batches/:uid` does not accept any parameters. Batch objects contain the following fields: ```json5 { "uid": 160, "progress": { "steps": [ { "currentStep": "processing tasks", "finished": 0, "total": 2 }, { "currentStep": "indexing", "finished": 2, "total": 3 }, { "currentStep": "extracting words", "finished": 3, "total": 13 }, { "currentStep": "document", "finished": 12300, "total": 19546 } ], "percentage": 37.986263 }, "details": { "receivedDocuments": 19547, "indexedDocuments": null }, "stats": { "totalNbTasks": 1, "status": { "processing": 1 }, "types": { "documentAdditionOrUpdate": 1 }, "indexUids": { "mieli": 1 } }, "duration": null, "startedAt": "2024-12-12T09:44:34.124726733Z", "finishedAt": null } ``` Additionally, task objects now include a new field, `batchUid`. Use this field together with `/batches/:uid` to retrieve data on a specific batch. ```json5 { "uid": 154, "batchUid": 142, "indexUid": "movies_test2", "status": "succeeded", "type": "documentAdditionOrUpdate", "canceledBy": null, "details": { "receivedDocuments": 1, "indexedDocuments": 1 }, "error": null, "duration": "PT0.027766819S", "enqueuedAt": "2024-12-02T14:07:34.974430765Z", "startedAt": "2024-12-02T14:07:34.99021667Z", "finishedAt": "2024-12-02T14:07:35.017983489Z" } ``` Done by [@irevoire](https://redirect.github.com/irevoire) in [#5060](https://redirect.github.com/meilisearch/meilisearch/issues/5060), [#5070](https://redirect.github.com/meilisearch/meilisearch/issues/5070), [#5080](https://redirect.github.com/meilisearch/meilisearch/issues/5080) #### Other improvements - New query parameter for `GET` `/tasks`: `reverse`. If `reverse` is set to `true`, tasks will be returned in reversed order, from oldest to newest tasks. Done by [@irevoire](https://redirect.github.com/irevoire) in [#5048](https://redirect.github.com/meilisearch/meilisearch/issues/5048) - Phrase searches with`showMatchesPosition` set to `true` give a single location for the whole phrase [@flevi29](https://redirect.github.com/flevi29) in [#4928](https://redirect.github.com/meilisearch/meilisearch/issues/4928) - New Prometheus metrics by [@PedroTurik](https://redirect.github.com/PedroTurik) in [#5044](https://redirect.github.com/meilisearch/meilisearch/issues/5044) - When a query finds matching terms in document fields with array values, Meilisearch now includes an `indices` field to `_matchesPosition` specifying which array elements contain the matches by [@LukasKalbertodt](https://redirect.github.com/LukasKalbertodt) in [#5005](https://redirect.github.com/meilisearch/meilisearch/issues/5005) - ⚠️ Breaking `vectorStore` change: field distribution no longer contains `_vectors`. Its value used to be incorrect, and there is no current use case for the fixed, most likely empty, value. Done as part of [#4900](https://redirect.github.com/meilisearch/meilisearch/issues/4900) - Improve error message by adding index name in [#5056](https://redirect.github.com/meilisearch/meilisearch/issues/5056) by [@airycanon](https://redirect.github.com/airycanon) ### Fixes 🐞 - Return appropriate error when primary key is greater than 512 bytes, by [@flevi29](https://redirect.github.com/flevi29) in [#4930](https://redirect.github.com/meilisearch/meilisearch/issues/4930) - Fix issue where numbers were segmented in different ways depending on tokenizer, by [@dqkqd](https://redirect.github.com/dqkqd) in [https://github.com/meilisearch/charabia/pull/311](https://redirect.github.com/meilisearch/charabia/pull/311) - Fix pagination when embedding fails by [@dureuill](https://redirect.github.com/dureuill) in [https://github.com/meilisearch/meilisearch/pull/5063](https://redirect.github.com/meilisearch/meilisearch/pull/5063) - Fix issue causing Meilisearch to ignore stop words in some cases by [@ManyTheFish](https://redirect.github.com/ManyTheFish) in [#5062](https://redirect.github.com/meilisearch/meilisearch/issues/5062) - Fix phrase search with `attributesToSearchOn` in [#5062](https://redirect.github.com/meilisearch/meilisearch/issues/5062) by [@ManyTheFish](https://redirect.github.com/ManyTheFish) ### Misc - Dependencies updates - Update benchmarks to match the new crates subfolder by [@Kerollmops](https://redirect.github.com/Kerollmops) in [#5021](https://redirect.github.com/meilisearch/meilisearch/issues/5021) - Fix the benchmarks by [@irevoire](https://redirect.github.com/irevoire) in [#5037](https://redirect.github.com/meilisearch/meilisearch/issues/5037) - Bump Swatinem/rust-cache from 2.7.1 to 2.7.5 in [#5030](https://redirect.github.com/meilisearch/meilisearch/issues/5030) - Update charabia v0.9.2 by [@ManyTheFish](https://redirect.github.com/ManyTheFish) in [#5098](https://redirect.github.com/meilisearch/meilisearch/issues/5098) - Update mini-dashboard to v0.2.16 version by [@curquiza](https://redirect.github.com/curquiza) in [#5102](https://redirect.github.com/meilisearch/meilisearch/issues/5102) - CIs and tests - Improve performance of `delete_index.rs` by [@DerTimonius](https://redirect.github.com/DerTimonius) in [#4963](https://redirect.github.com/meilisearch/meilisearch/issues/4963) - Improve performance of `create_index.rs` by [@DerTimonius](https://redirect.github.com/DerTimonius) in [#4962](https://redirect.github.com/meilisearch/meilisearch/issues/4962) - Improve performance of `get_documents.rs` by [@PedroTurik](https://redirect.github.com/PedroTurik) in [#5025](https://redirect.github.com/meilisearch/meilisearch/issues/5025) - Improve performance of `formatted.rs` by [@PedroTurik](https://redirect.github.com/PedroTurik) in [#5043](https://redirect.github.com/meilisearch/meilisearch/issues/5043) - Fix the path used in the flaky tests CI by [@Kerollmops](https://redirect.github.com/Kerollmops) in [#5049](https://redirect.github.com/meilisearch/meilisearch/issues/5049) - Misc - Rollback the Meilisearch Kawaii logo by [@Kerollmops](https://redirect.github.com/Kerollmops) in [#5017](https://redirect.github.com/meilisearch/meilisearch/issues/5017) - Add image source label to Dockerfile by [@wuast94](https://redirect.github.com/wuast94) in [#4990](https://redirect.github.com/meilisearch/meilisearch/issues/4990) - Hide code complexity into a subfolder by [@Kerollmops](https://redirect.github.com/Kerollmops) in [#5016](https://redirect.github.com/meilisearch/meilisearch/issues/5016) - Internal tool: implement offline upgrade from v1.10 to v1.11 by [@irevoire](https://redirect.github.com/irevoire) in [#5034](https://redirect.github.com/meilisearch/meilisearch/issues/5034) - Internal tool: implement offline upgrade from v1.11 to v1.12 by [@ManyTheFish](https://redirect.github.com/ManyTheFish) in [#5146](https://redirect.github.com/meilisearch/meilisearch/issues/5146) - Meilisearch is now able to retrieve Katakana words from a Hiragana query by [@tats-u](https://redirect.github.com/tats-u) in [https://github.com/meilisearch/charabia/pull/312](https://redirect.github.com/meilisearch/charabia/pull/312) - Improve error handling when writing into LMDB by [@Kerollmops](https://redirect.github.com/Kerollmops) in [https://github.com/meilisearch/meilisearch/pull/5089](https://redirect.github.com/meilisearch/meilisearch/pull/5089) ❤️ Thanks again to our external contributors: - [Meilisearch](https://redirect.github.com/meilisearch/meilisearch): [@airycanon](https://redirect.github.com/airycanon), [@DerTimonius](https://redirect.github.com/DerTimonius), [@flevi29](https://redirect.github.com/flevi29), [@LukasKalbertodt](https://redirect.github.com/LukasKalbertodt), [@PedroTurik](https://redirect.github.com/PedroTurik), [@wuast94](https://redirect.github.com/wuast94) - [Charabia](https://redirect.github.com/meilisearch/charabia): [@dqkqd](https://redirect.github.com/dqkqd) [@tats-u](https://redirect.github.com/tats-u) ### [`v1.11.3`](https://redirect.github.com/meilisearch/meilisearch/releases/tag/v1.11.3): 🐿️ [Compare Source](https://redirect.github.com/meilisearch/meilisearch/compare/v1.11.2...v1.11.3) #### What's Changed - For REST/OpenAI/ollama autoembedders users: Retry if deserialization of remote response failed by [@dureuill](https://redirect.github.com/dureuill) in [https://github.com/meilisearch/meilisearch/pull/5058](https://redirect.github.com/meilisearch/meilisearch/pull/5058) **Full Changelog**: meilisearch/meilisearch@v1.11.2...v1.11.3 ### [`v1.11.2`](https://redirect.github.com/meilisearch/meilisearch/releases/tag/v1.11.2): 🐿️ [Compare Source](https://redirect.github.com/meilisearch/meilisearch/compare/v1.11.1...v1.11.2) #### What's Changed - Add timeout on read and write operations. by [@dureuill](https://redirect.github.com/dureuill) in [https://github.com/meilisearch/meilisearch/pull/5051](https://redirect.github.com/meilisearch/meilisearch/pull/5051) **Full Changelog**: meilisearch/meilisearch@v1.11.1...v1.11.2 ### [`v1.11.1`](https://redirect.github.com/meilisearch/meilisearch/releases/tag/v1.11.1): 🐿️ [Compare Source](https://redirect.github.com/meilisearch/meilisearch/compare/v1.11.0...v1.11.1) #### What's Changed - Add 3s timeout to embedding requests made during search by [@dureuill](https://redirect.github.com/dureuill) in [https://github.com/meilisearch/meilisearch/pull/5039](https://redirect.github.com/meilisearch/meilisearch/pull/5039) **Full Changelog**: meilisearch/meilisearch@v1.11.0...v1.11.1 ### [`v1.11.0`](https://redirect.github.com/meilisearch/meilisearch/releases/tag/v1.11.0): 🐿️ [Compare Source](https://redirect.github.com/meilisearch/meilisearch/compare/v1.10.3...v1.11.0) Meilisearch v1.11 introduces AI-powered search performance improvements thanks to binary quantization and various usage changes, all of which are steps towards a future stabilization of the feature. We have also improved federated search usage following user feedback. 🧰 All official Meilisearch integrations (including SDKs, clients, and other tools) are compatible with this Meilisearch release. Integration deployment happens between 4 to 48 hours after a new version becomes available. Some SDKs might not include all new features. Consult the project repository for detailed information. Is a feature you need missing from your chosen SDK? Create an issue letting us know you need it, or, for open-source karma points, open a PR implementing it (we'll love you for that ❤️). ### New features and updates 🔥 #### Experimental - AI-powered search improvements This release is Meilisearch's first step towards stabilizing AI-powered search and introduces a few breaking changes to its API. [Consult the PRD for full usage details.](https://www.notion.so/meilisearch/v1-11-AI-search-changes-0e37727193884a70999f254fa953ce6e) Done by [@dureuill](https://redirect.github.com/dureuill) in [#4906](https://redirect.github.com/meilisearch/meilisearch/issues/4906), [#4920](https://redirect.github.com/meilisearch/meilisearch/issues/4920), [#4892](https://redirect.github.com/meilisearch/meilisearch/issues/4892), and [#4938](https://redirect.github.com/meilisearch/meilisearch/issues/4938). ##### ⚠️ Breaking changes - When performing AI-powered searches, `hybrid.embedder` is now a **mandatory** parameter in `GET` and `POST` `/indexes/{:indexUid}/search` - As a consequence, it is now **mandatory** to pass `hybrid` even for pure semantic searches - `embedder` is now a **mandatory** parameter in `GET` and `POST` `/indexes/{:indexUid}/similar` - Meilisearch now ignores `semanticRatio` and performs a pure semantic search for queries that include `vector` but not `q` ##### Addition & improvements - The default model for OpenAI is now `text-embedding-3-small` instead of `text-embedding-ada-002` - This release introduces a new embedder option: `documentTemplateMaxBytes`. Meilisearch will truncate a document's template text when it goes over the specified limit - Fields in `documentTemplate` include a new `field.is_searchable` property. The default document template now filters out both empty fields and fields not in the searchable attributes list: v1.11: {% for field in fields %} {% if field.is_searchable and not field.value == nil %} {{ field.name }}: {{ field.value }}\n {% endif %} {% endfor %} v1.10: {% for field in fields %} {{ field.name }}: {{ field.value }}\n {% endfor %} Embedders using the v1.10 document template will continue working as before. The new default document template will only work with newly created embedders. #### Vector database indexing performance improvements v1.11 introduces a new embedder option, `binaryQuantized`: ```bash curl \ -X PATCH 'http://localhost:7700/indexes/movies/settings' \ -H 'Content-Type: application/json' \ --data-binary '{ "embedders": { "image2text": { "binaryQuantized": true } } }' ``` Enable binary quantization to convert embeddings of floating point numbers into embeddings of boolean values. This will negatively impact the relevancy of AI-powered searches but significantly improve performance in large collections with more than 100 dimensions. In our benchmarks, this reduced the size of the database by a factor of 10 and divided the indexing time by a factor of 6 with little impact on search times. > \[!WARNING] > Enabling this feature will update all of your vectors to contain only `1`s or `-1`s, significantly impacting relevancy. > > **You cannot revert this option once you enable it**. Before setting `binaryQuantized` to `true`, Meilisearch recommends testing it in a smaller or duplicate index in a development environment. Done by [@irevoire](https://redirect.github.com/irevoire) in [#4941](https://redirect.github.com/meilisearch/meilisearch/issues/4941). #### Federated search improvements ##### Facet distribution and stats for federated searches This release adds two new federated search options, `facetsByIndex` and `mergeFacets`. These allow you to request a federated search for facet distributions and stats data. ##### Facet information by index To obtain facet distribution and stats for each separate index, use `facetsByIndex` when querying the `POST` `/multi-search` endpoint: ```json5 POST /multi-search { "federation": { "limit": 20, "offset": 0, "facetsByIndex": { "movies": ["title", "id"], "comics": ["title"], } }, "queries": [ { "q": "Batman", "indexUid": "movies" }, { "q": "Batman", "indexUid": "comics" } ] } ``` The multi-search response will include a new field, `facetsByIndex` with facet data separated per index: ```json5 { "hits": […], … "facetsByIndex": { "movies": { "distribution": { "title": { "Batman returns": 1 }, "id": { "42": 1 } }, "stats": { "id": { "min": 42, "max": 42 } } }, … } } ``` ##### Merged facet information To obtain facet distribution and stats for all indexes merged into a single, use both `facetsByIndex` and `mergeFacets` when querying the `POST` `/multi-search` endpoint: ```json5 POST /multi-search { "federation": { "limit": 20, "offset": 0, "facetsByIndex": { "movies": ["title", "id"], "comics": ["title"], }, "mergeFacets": { "maxValuesPerFacet": 10, } } "queries": [ { "q": "Batman", "indexUid": "movies" }, { "q": "Batman", "indexUid": "comics" } ] } ``` The response includes two new fields, `facetDistribution` and `facetStarts`: ```json5 { "hits": […], … "facetDistribution": { "title": { "Batman returns": 1 "Batman: the killing joke": }, "id": { "42": 1 } }, "facetStats": { "id": { "min": 42, "max": 42 } } } ``` Done by [@dureuill](https://redirect.github.com/dureuill) in [#4929](https://redirect.github.com/meilisearch/meilisearch/issues/4929). #### Experimental — New `STARTS WITH` filter operator Enable the experimental feature to use the `STARTS WITH` filter operator: ```bash curl \ -X PATCH 'http://localhost:7700/experimental-features/' \ -H 'Content-Type: application/json' \ --data-binary '{ "containsFilter": true }' ``` Use the `STARTS WITH` operator when filtering: ```json5 curl \ -X POST http://localhost:7700/indexes/movies/search \ -H 'Content-Type: application/json' \ --data-binary '{ "filter": "hero STARTS WITH spider" }' ``` 🗣️ This is an experimental feature, and we need your help to improve it! Share your thoughts and feedback on this [GitHub discussion](https://redirect.github.com/orgs/meilisearch/discussions/763). Done by [@Kerollmops](https://redirect.github.com/Kerollmops) in [#4939](https://redirect.github.com/meilisearch/meilisearch/issues/4939). #### Other improvements - Language support and [localizedAttributes settings](https://www.meilisearch.com/docs/reference/api/settings#localized-attributes) by [@ManyTheFish](https://redirect.github.com/ManyTheFish) in [#4937](https://redirect.github.com/meilisearch/meilisearch/issues/4937) - Add ISO-639-1 variants - Convert ISO-639-1 into ISO-639-3 - Add a German language tokenizer by [@luflow](https://redirect.github.com/luflow) in [meilisearch/charabia#303](https://redirect.github.com/meilisearch/charabia/issues/303) and in [#4945](https://redirect.github.com/meilisearch/meilisearch/issues/4945) - Improve Turkish language support by [@tkhshtsh0917](https://redirect.github.com/tkhshtsh0917) in [meilisearch/charabia#305](https://redirect.github.com/meilisearch/charabia/issues/305) and in [#4957](https://redirect.github.com/meilisearch/meilisearch/issues/4957) - Upgrade "batch failed" log to error level in [#4955](https://redirect.github.com/meilisearch/meilisearch/issues/4955) by [@dureuill](https://redirect.github.com/dureuill). - Update the search UI: remove the forced capitalized fields, by [@curquiza](https://redirect.github.com/curquiza) in [#4993](https://redirect.github.com/meilisearch/meilisearch/issues/4993) ### Fixes 🐞 - ⚠️ When using federated search, `query.facets` was silently ignored at the query level, but should not have been. It now returns the appropriate error. Use `federation.facetsByIndex` instead if you want facets to be applied during federated search. - Prometheus `/metrics` return the route pattern instead of the real route when returning the HTTP requests total by [@irevoire](https://redirect.github.com/irevoire) in [#4839](https://redirect.github.com/meilisearch/meilisearch/issues/4839) - Truncate values at the end of a list of facet values when the number of facet values is larger than `maxValuesPerFacet`. For example, setting `maxValuesPerFacet` to `2` could result in `["blue", "red", "yellow"]`, being truncated to `["blue", "yellow"]` instead of \["blue", "red"]\`. By [@dureuill](https://redirect.github.com/dureuill) in [#4929](https://redirect.github.com/meilisearch/meilisearch/issues/4929) - Improve the task cancellation when vectors are used, by [@irevoire](https://redirect.github.com/irevoire) in [#4971](https://redirect.github.com/meilisearch/meilisearch/issues/4971) - Swedish support: the characters `å`, `ä`, `ö` are no longer normalized to `a` and `o`. By [@ManyTheFish](https://redirect.github.com/ManyTheFish) in [#4945](https://redirect.github.com/meilisearch/meilisearch/issues/4945) - Update rhai to fix an internal error when [updating documents with a function](https://redirect.github.com/orgs/meilisearch/discussions/762) (experimental) by [@irevoire](https://redirect.github.com/irevoire) in [#4960](https://redirect.github.com/meilisearch/meilisearch/issues/4960) - Fix the bad experimental search queue size by [@irevoire](https://redirect.github.com/irevoire) in [#4992](https://redirect.github.com/meilisearch/meilisearch/issues/4992) - Do not send empty edit document by function by [@irevoire](https://redirect.github.com/irevoire) in [#5001](https://redirect.github.com/meilisearch/meilisearch/issues/5001) - Display vectors when no custom vectors were ever provided by [@dureuill](https://redirect.github.com/dureuill) in [#5008](https://redirect.github.com/meilisearch/meilisearch/issues/5008) ### Misc - Dependencies updates - Security dependency upgrade: bump quinn-proto from 0.11.3 to 0.11.8 by [@dependabot](https://redirect.github.com/dependabot) in [#4911](https://redirect.github.com/meilisearch/meilisearch/issues/4911) - CIs and tests - Make the tests run faster by [@irevoire](https://redirect.github.com/irevoire) in [#4808](https://redirect.github.com/meilisearch/meilisearch/issues/4808) - Documentation - Fix broken links in README by [@iornstein](https://redirect.github.com/iornstein) in [#4943](https://redirect.github.com/meilisearch/meilisearch/issues/4943) - Misc - Allow Meilitool to upgrade from v1.9 to v1.10 without a dump in some conditions, by [@dureuill](https://redirect.github.com/dureuill) in [#4912](https://redirect.github.com/meilisearch/meilisearch/issues/4912) - Fix bench by adding embedder by [@dureuill](https://redirect.github.com/dureuill) in [#4954](https://redirect.github.com/meilisearch/meilisearch/issues/4954) - Revamp analytics by [@irevoire](https://redirect.github.com/irevoire) in [#5011](https://redirect.github.com/meilisearch/meilisearch/issues/5011) ❤️ Thanks again to our external contributors: - [Meilisearch](https://redirect.github.com/meilisearch/meilisearchg): [@iornstein](https://redirect.github.com/iornstein). - [Charabia](https://redirect.github.com/meilisearch/charabia): [@luflow](https://redirect.github.com/luflow), [@tkhshtsh0917](https://redirect.github.com/tkhshtsh0917). </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://redirect.github.com/renovatebot/renovate).

This PR contains the following updates: | Package | Update | Change | |---|---|---| | [getmeili/meilisearch](https://togithub.com/meilisearch/meilisearch) | minor | `v1.1.0` -> `v1.12.1` | --- > [!WARNING] > Some dependencies could not be looked up. Check the Dependency Dashboard for more information. --- ### Release Notes <details> <summary>meilisearch/meilisearch (getmeili/meilisearch)</summary> ### [`v1.12.1`](https://togithub.com/meilisearch/meilisearch/releases/tag/v1.12.1) [Compare Source](https://togithub.com/meilisearch/meilisearch/compare/v1.12.0...v1.12.1) #### Fixes There was a bug in the engine when adding an empty payload, it was making the batch fails. Fixed by [@irevoire](https://togithub.com/irevoire) in [https://github.com/meilisearch/meilisearch/pull/5192](https://togithub.com/meilisearch/meilisearch/pull/5192) **Full Changelog**: https://github.com/meilisearch/meilisearch/compare/v1.12.0...v1.12.1 ### [`v1.12.0`](https://togithub.com/meilisearch/meilisearch/releases/tag/v1.12.0): 🦗 [Compare Source](https://togithub.com/meilisearch/meilisearch/compare/v1.11.3...v1.12.0) Meilisearch v1.12 introduces significant indexing speed improvements, almost halving the time required to index large datasets. This release also introduces new settings to customize and potentially further increase indexing speed. 🧰 All official Meilisearch integrations (including SDKs, clients, and other tools) are compatible with this Meilisearch release. Integration deployment happens between 4 to 48 hours after a new version becomes available. Some SDKs might not include all new features. Consult the project repository for detailed information. Is a feature you need missing from your chosen SDK? Create an issue letting us know you need it, or, for open-source karma points, open a PR implementing it (we'll love you for that ❤️). ### New features and updates 🔥 #### Improve indexing speed Indexing time is improved across the board! - Performance is maintained or better on smaller machines - On bigger machines with multiple cores and good IO, Meilisearch v1.12 is much faster than Meilisearch v1.11 - More than twice as fast for raw document insertion tasks. - More than x4 as fast for incrementally updating documents in a large database. - Embeddings generation was also improved up to x1.5 for some workloads. The new indexer also makes task cancellation faster. Done by [@dureuill](https://togithub.com/dureuill), [@ManyTheFish](https://togithub.com/ManyTheFish), and [@Kerollmops](https://togithub.com/Kerollmops) in [#4900](https://togithub.com/meilisearch/meilisearch/issues/4900). #### New index settings: use `facetSearch` and `prefixSearch` to improve indexing speed v1.12 introduces two new index settings: `facetSearch` and `prefixSearch`. Both settings allow you to skip parts of the indexing process. This leads to significant improvements to indexing speed, but may negatively impact search experience in some use cases. Done by [@ManyTheFish](https://togithub.com/ManyTheFish) in [#5091](https://togithub.com/meilisearch/meilisearch/issues/5091) ##### `facetSearch` Use this setting to toggle [facet search](https://www.meilisearch.com/docs/learn/filtering_and_sorting/search_with_facet_filters#searching-facet-values): ```bash curl \ -X PUT 'http://localhost:7700/indexes/books/settings/facet-search' \ -H 'Content-Type: application/json' \ --data-binary 'true' ``` The default value for `facetSearch` is `true`. When set to `false`, this setting disables facet search for all filterable attributes in an index. ##### `prefixSearch` Use this setting to configure the ability to [search a word by prefix](https://www.meilisearch.com/docs/learn/engine/prefix) on an index: ```bash curl \ -X PUT 'http://localhost:7700/indexes/books/settings/prefix-search' \ -H 'Content-Type: application/json' \ --data-binary 'disabled' ``` `prefixSearch` accepts one of the following values: - `"indexingTime"`: enables prefix processing during indexing. This is the default Meilisearch behavior - `"disabled"`: deactivates prefix search completely Disabling prefix search means the query `he` will no longer match the word `hello`. This may significantly impact search result relevancy, but speeds up the indexing process. #### New API route: `/batches` The new `/batches` endpoint allow you to query information about task batches. `GET` `/batches` returns a list of batch objects: ```sh curl -X GET 'http://localhost:7700/batches' ``` This endpoint accepts the same parameters as `GET` `/tasks` route, allowing you to narrow down which batches you want to see. Parameters used with `GET` `/batches` apply to the tasks, not the batches themselves. For example, `GET /batches?uid=0` returns batches containing tasks with a `taskUid` of `0` , not batches with a `batchUid` of `0`. You may also query `GET` `/batches/:uid` to retrieve information about a single batch object: ```sh curl -X GET 'http://localhost:7700/batches/BATCH_UID' ``` `/batches/:uid` does not accept any parameters. Batch objects contain the following fields: ```json5 { "uid": 160, "progress": { "steps": [ { "currentStep": "processing tasks", "finished": 0, "total": 2 }, { "currentStep": "indexing", "finished": 2, "total": 3 }, { "currentStep": "extracting words", "finished": 3, "total": 13 }, { "currentStep": "document", "finished": 12300, "total": 19546 } ], "percentage": 37.986263 }, "details": { "receivedDocuments": 19547, "indexedDocuments": null }, "stats": { "totalNbTasks": 1, "status": { "processing": 1 }, "types": { "documentAdditionOrUpdate": 1 }, "indexUids": { "mieli": 1 } }, "duration": null, "startedAt": "2024-12-12T09:44:34.124726733Z", "finishedAt": null } ``` Additionally, task objects now include a new field, `batchUid`. Use this field together with `/batches/:uid` to retrieve data on a specific batch. ```json5 { "uid": 154, "batchUid": 142, "indexUid": "movies_test2", "status": "succeeded", "type": "documentAdditionOrUpdate", "canceledBy": null, "details": { "receivedDocuments": 1, "indexedDocuments": 1 }, "error": null, "duration": "PT0.027766819S", "enqueuedAt": "2024-12-02T14:07:34.974430765Z", "startedAt": "2024-12-02T14:07:34.99021667Z", "finishedAt": "2024-12-02T14:07:35.017983489Z" } ``` Done by [@irevoire](https://togithub.com/irevoire) in [#5060](https://togithub.com/meilisearch/meilisearch/issues/5060), [#5070](https://togithub.com/meilisearch/meilisearch/issues/5070), [#5080](https://togithub.com/meilisearch/meilisearch/issues/5080) #### Other improvements - New query parameter for `GET` `/tasks`: `reverse`. If `reverse` is set to `true`, tasks will be returned in reversed order, from oldest to newest tasks. Done by [@irevoire](https://togithub.com/irevoire) in [#5048](https://togithub.com/meilisearch/meilisearch/issues/5048) - Phrase searches with`showMatchesPosition` set to `true` give a single location for the whole phrase [@flevi29](https://togithub.com/flevi29) in [#4928](https://togithub.com/meilisearch/meilisearch/issues/4928) - New Prometheus metrics by [@PedroTurik](https://togithub.com/PedroTurik) in [#5044](https://togithub.com/meilisearch/meilisearch/issues/5044) - When a query finds matching terms in document fields with array values, Meilisearch now includes an `indices` field to `_matchesPosition` specifying which array elements contain the matches by [@LukasKalbertodt](https://togithub.com/LukasKalbertodt) in [#5005](https://togithub.com/meilisearch/meilisearch/issues/5005) - ⚠️ Breaking `vectorStore` change: field distribution no longer contains `_vectors`. Its value used to be incorrect, and there is no current use case for the fixed, most likely empty, value. Done as part of [#4900](https://togithub.com/meilisearch/meilisearch/issues/4900) - Improve error message by adding index name in [#5056](https://togithub.com/meilisearch/meilisearch/issues/5056) by [@airycanon](https://togithub.com/airycanon) ### Fixes 🐞 - Return appropriate error when primary key is greater than 512 bytes, by [@flevi29](https://togithub.com/flevi29) in [#4930](https://togithub.com/meilisearch/meilisearch/issues/4930) - Fix issue where numbers were segmented in different ways depending on tokenizer, by [@dqkqd](https://togithub.com/dqkqd) in [https://github.com/meilisearch/charabia/pull/311](https://togithub.com/meilisearch/charabia/pull/311) - Fix pagination when embedding fails by [@dureuill](https://togithub.com/dureuill) in [https://github.com/meilisearch/meilisearch/pull/5063](https://togithub.com/meilisearch/meilisearch/pull/5063) - Fix issue causing Meilisearch to ignore stop words in some cases by [@ManyTheFish](https://togithub.com/ManyTheFish) in [#5062](https://togithub.com/meilisearch/meilisearch/issues/5062) - Fix phrase search with `attributesToSearchOn` in [#5062](https://togithub.com/meilisearch/meilisearch/issues/5062) by [@ManyTheFish](https://togithub.com/ManyTheFish) ### Misc - Dependencies updates - Update benchmarks to match the new crates subfolder by [@Kerollmops](https://togithub.com/Kerollmops) in [#5021](https://togithub.com/meilisearch/meilisearch/issues/5021) - Fix the benchmarks by [@irevoire](https://togithub.com/irevoire) in [#5037](https://togithub.com/meilisearch/meilisearch/issues/5037) - Bump Swatinem/rust-cache from 2.7.1 to 2.7.5 in [#5030](https://togithub.com/meilisearch/meilisearch/issues/5030) - Update charabia v0.9.2 by [@ManyTheFish](https://togithub.com/ManyTheFish) in [#5098](https://togithub.com/meilisearch/meilisearch/issues/5098) - Update mini-dashboard to v0.2.16 version by [@curquiza](https://togithub.com/curquiza) in [#5102](https://togithub.com/meilisearch/meilisearch/issues/5102) - CIs and tests - Improve performance of `delete_index.rs` by [@DerTimonius](https://togithub.com/DerTimonius) in [#4963](https://togithub.com/meilisearch/meilisearch/issues/4963) - Improve performance of `create_index.rs` by [@DerTimonius](https://togithub.com/DerTimonius) in [#4962](https://togithub.com/meilisearch/meilisearch/issues/4962) - Improve performance of `get_documents.rs` by [@PedroTurik](https://togithub.com/PedroTurik) in [#5025](https://togithub.com/meilisearch/meilisearch/issues/5025) - Improve performance of `formatted.rs` by [@PedroTurik](https://togithub.com/PedroTurik) in [#5043](https://togithub.com/meilisearch/meilisearch/issues/5043) - Fix the path used in the flaky tests CI by [@Kerollmops](https://togithub.com/Kerollmops) in [#5049](https://togithub.com/meilisearch/meilisearch/issues/5049) - Misc - Rollback the Meilisearch Kawaii logo by [@Kerollmops](https://togithub.com/Kerollmops) in [#5017](https://togithub.com/meilisearch/meilisearch/issues/5017) - Add image source label to Dockerfile by [@wuast94](https://togithub.com/wuast94) in [#4990](https://togithub.com/meilisearch/meilisearch/issues/4990) - Hide code complexity into a subfolder by [@Kerollmops](https://togithub.com/Kerollmops) in [#5016](https://togithub.com/meilisearch/meilisearch/issues/5016) - Internal tool: implement offline upgrade from v1.10 to v1.11 by [@irevoire](https://togithub.com/irevoire) in [#5034](https://togithub.com/meilisearch/meilisearch/issues/5034) - Internal tool: implement offline upgrade from v1.11 to v1.12 by [@ManyTheFish](https://togithub.com/ManyTheFish) in [#5146](https://togithub.com/meilisearch/meilisearch/issues/5146) - Meilisearch is now able to retrieve Katakana words from a Hiragana query by [@tats-u](https://togithub.com/tats-u) in [https://github.com/meilisearch/charabia/pull/312](https://togithub.com/meilisearch/charabia/pull/312) - Improve error handling when writing into LMDB by [@Kerollmops](https://togithub.com/Kerollmops) in [https://github.com/meilisearch/meilisearch/pull/5089](https://togithub.com/meilisearch/meilisearch/pull/5089) ❤️ Thanks again to our external contributors: - [Meilisearch](https://togithub.com/meilisearch/meilisearch): [@airycanon](https://togithub.com/airycanon), [@DerTimonius](https://togithub.com/DerTimonius), [@flevi29](https://togithub.com/flevi29), [@LukasKalbertodt](https://togithub.com/LukasKalbertodt), [@PedroTurik](https://togithub.com/PedroTurik), [@wuast94](https://togithub.com/wuast94) - [Charabia](https://togithub.com/meilisearch/charabia): [@dqkqd](https://togithub.com/dqkqd) [@tats-u](https://togithub.com/tats-u) ### [`v1.11.3`](https://togithub.com/meilisearch/meilisearch/releases/tag/v1.11.3): 🐿️ [Compare Source](https://togithub.com/meilisearch/meilisearch/compare/v1.11.2...v1.11.3) #### What's Changed - For REST/OpenAI/ollama autoembedders users: Retry if deserialization of remote response failed by [@dureuill](https://togithub.com/dureuill) in [https://github.com/meilisearch/meilisearch/pull/5058](https://togithub.com/meilisearch/meilisearch/pull/5058) **Full Changelog**: https://github.com/meilisearch/meilisearch/compare/v1.11.2...v1.11.3 ### [`v1.11.2`](https://togithub.com/meilisearch/meilisearch/releases/tag/v1.11.2): 🐿️ [Compare Source](https://togithub.com/meilisearch/meilisearch/compare/v1.11.1...v1.11.2) #### What's Changed - Add timeout on read and write operations. by [@dureuill](https://togithub.com/dureuill) in [https://github.com/meilisearch/meilisearch/pull/5051](https://togithub.com/meilisearch/meilisearch/pull/5051) **Full Changelog**: https://github.com/meilisearch/meilisearch/compare/v1.11.1...v1.11.2 ### [`v1.11.1`](https://togithub.com/meilisearch/meilisearch/releases/tag/v1.11.1): 🐿️ [Compare Source](https://togithub.com/meilisearch/meilisearch/compare/v1.11.0...v1.11.1) #### What's Changed - Add 3s timeout to embedding requests made during search by [@dureuill](https://togithub.com/dureuill) in [https://github.com/meilisearch/meilisearch/pull/5039](https://togithub.com/meilisearch/meilisearch/pull/5039) **Full Changelog**: https://github.com/meilisearch/meilisearch/compare/v1.11.0...v1.11.1 ### [`v1.11.0`](https://togithub.com/meilisearch/meilisearch/releases/tag/v1.11.0): 🐿️ [Compare Source](https://togithub.com/meilisearch/meilisearch/compare/v1.10.3...v1.11.0) Meilisearch v1.11 introduces AI-powered search performance improvements thanks to binary quantization and various usage changes, all of which are steps towards a future stabilization of the feature. We have also improved federated search usage following user feedback. 🧰 All official Meilisearch integrations (including SDKs, clients, and other tools) are compatible with this Meilisearch release. Integration deployment happens between 4 to 48 hours after a new version becomes available. Some SDKs might not include all new features. Consult the project repository for detailed information. Is a feature you need missing from your chosen SDK? Create an issue letting us know you need it, or, for open-source karma points, open a PR implementing it (we'll love you for that ❤️). ### New features and updates 🔥 #### Experimental - AI-powered search improvements This release is Meilisearch's first step towards stabilizing AI-powered search and introduces a few breaking changes to its API. [Consult the PRD for full usage details.](https://www.notion.so/meilisearch/v1-11-AI-search-changes-0e37727193884a70999f254fa953ce6e) Done by [@dureuill](https://togithub.com/dureuill) in [#4906](https://togithub.com/meilisearch/meilisearch/issues/4906), [#4920](https://togithub.com/meilisearch/meilisearch/issues/4920), [#4892](https://togithub.com/meilisearch/meilisearch/issues/4892), and [#4938](https://togithub.com/meilisearch/meilisearch/issues/4938). ##### ⚠️ Breaking changes - When performing AI-powered searches, `hybrid.embedder` is now a **mandatory** parameter in `GET` and `POST` `/indexes/{:indexUid}/search` - As a consequence, it is now **mandatory** to pass `hybrid` even for pure semantic searches - `embedder` is now a **mandatory** parameter in `GET` and `POST` `/indexes/{:indexUid}/similar` - Meilisearch now ignores `semanticRatio` and performs a pure semantic search for queries that include `vector` but not `q` ##### Addition & improvements - The default model for OpenAI is now `text-embedding-3-small` instead of `text-embedding-ada-002` - This release introduces a new embedder option: `documentTemplateMaxBytes`. Meilisearch will truncate a document's template text when it goes over the specified limit - Fields in `documentTemplate` include a new `field.is_searchable` property. The default document template now filters out both empty fields and fields not in the searchable attributes list: v1.11: {% for field in fields %} {% if field.is_searchable and not field.value == nil %} {{ field.name }}: {{ field.value }}\n {% endif %} {% endfor %} v1.10: {% for field in fields %} {{ field.name }}: {{ field.value }}\n {% endfor %} Embedders using the v1.10 document template will continue working as before. The new default document template will only work with newly created embedders. #### Vector database indexing performance improvements v1.11 introduces a new embedder option, `binaryQuantized`: ```bash curl \ -X PATCH 'http://localhost:7700/indexes/movies/settings' \ -H 'Content-Type: application/json' \ --data-binary '{ "embedders": { "image2text": { "binaryQuantized": true } } }' ``` Enable binary quantization to convert embeddings of floating point numbers into embeddings of boolean values. This will negatively impact the relevancy of AI-powered searches but significantly improve performance in large collections with more than 100 dimensions. In our benchmarks, this reduced the size of the database by a factor of 10 and divided the indexing time by a factor of 6 with little impact on search times. > \[!WARNING] > Enabling this feature will update all of your vectors to contain only `1`s or `-1`s, significantly impacting relevancy. > > **You cannot revert this option once you enable it**. Before setting `binaryQuantized` to `true`, Meilisearch recommends testing it in a smaller or duplicate index in a development environment. Done by [@irevoire](https://togithub.com/irevoire) in [#4941](https://togithub.com/meilisearch/meilisearch/issues/4941). #### Federated search improvements ##### Facet distribution and stats for federated searches This release adds two new federated search options, `facetsByIndex` and `mergeFacets`. These allow you to request a federated search for facet distributions and stats data. ##### Facet information by index To obtain facet distribution and stats for each separate index, use `facetsByIndex` when querying the `POST` `/multi-search` endpoint: ```json5 POST /multi-search { "federation": { "limit": 20, "offset": 0, "facetsByIndex": { "movies": ["title", "id"], "comics": ["title"], } }, "queries": [ { "q": "Batman", "indexUid": "movies" }, { "q": "Batman", "indexUid": "comics" } ] } ``` The multi-search response will include a new field, `facetsByIndex` with facet data separated per index: ```json5 { "hits": […], … "facetsByIndex": { "movies": { "distribution": { "title": { "Batman returns": 1 }, "id": { "42": 1 } }, "stats": { "id": { "min": 42, "max": 42 } } }, … } } ``` ##### Merged facet information To obtain facet distribution and stats for all indexes merged into a single, use both `facetsByIndex` and `mergeFacets` when querying the `POST` `/multi-search` endpoint: ```json5 POST /multi-search { "federation": { "limit": 20, "offset": 0, "facetsByIndex": { "movies": ["title", "id"], "comics": ["title"], }, "mergeFacets": { "maxValuesPerFacet": 10, } } "queries": [ { "q": "Batman", "indexUid": "movies" }, { "q": "Batman", "indexUid": "comics" } ] } ``` The response includes two new fields, `facetDistribution` and `facetStarts`: ```json5 { "hits": […], … "facetDistribution": { "title": { "Batman returns": 1 "Batman: the killing joke": }, "id": { "42": 1 } }, "facetStats": { "id": { "min": 42, "max": 42 } } } ``` Done by [@dureuill](https://togithub.com/dureuill) in [#4929](https://togithub.com/meilisearch/meilisearch/issues/4929). #### Experimental — New `STARTS WITH` filter operator Enable the experimental feature to use the `STARTS WITH` filter operator: ```bash curl \ -X PATCH 'http://localhost:7700/experimental-features/' \ -H 'Content-Type: application/json' \ --data-binary '{ "containsFilter": true }' ``` Use the `STARTS WITH` operator when filtering: ```json5 curl \ -X POST http://localhost:7700/indexes/movies/search \ -H 'Content-Type: application/json' \ --data-binary '{ "filter": "hero STARTS WITH spider" }' ``` 🗣️ This is an experimental feature, and we need your help to improve it! Share your thoughts and feedback on this [GitHub discussion](https://togithub.com/orgs/meilisearch/discussions/763). Done by [@Kerollmops](https://togithub.com/Kerollmops) in [#4939](https://togithub.com/meilisearch/meilisearch/issues/4939). #### Other improvements - Language support and [localizedAttributes settings](https://www.meilisearch.com/docs/reference/api/settings#localized-attributes) by [@ManyTheFish](https://togithub.com/ManyTheFish) in [#4937](https://togithub.com/meilisearch/meilisearch/issues/4937) - Add ISO-639-1 variants - Convert ISO-639-1 into ISO-639-3 - Add a German language tokenizer by [@luflow](https://togithub.com/luflow) in [meilisearch/charabia#303](https://togithub.com/meilisearch/charabia/issues/303) and in [#4945](https://togithub.com/meilisearch/meilisearch/issues/4945) - Improve Turkish language support by [@tkhshtsh0917](https://togithub.com/tkhshtsh0917) in [meilisearch/charabia#305](https://togithub.com/meilisearch/charabia/issues/305) and in [#4957](https://togithub.com/meilisearch/meilisearch/issues/4957) - Upgrade "batch failed" log to error level in [#4955](https://togithub.com/meilisearch/meilisearch/issues/4955) by [@dureuill](https://togithub.com/dureuill). - Update the search UI: remove the forced capitalized fields, by [@curquiza](https://togithub.com/curquiza) in [#4993](https://togithub.com/meilisearch/meilisearch/issues/4993) ### Fixes 🐞 - ⚠️ When using federated search, `query.facets` was silently ignored at the query level, but should not have been. It now returns the appropriate error. Use `federation.facetsByIndex` instead if you want facets to be applied during federated search. - Prometheus `/metrics` return the route pattern instead of the real route when returning the HTTP requests total by [@irevoire](https://togithub.com/irevoire) in [#4839](https://togithub.com/meilisearch/meilisearch/issues/4839) - Truncate values at the end of a list of facet values when the number of facet values is larger than `maxValuesPerFacet`. For example, setting `maxValuesPerFacet` to `2` could result in `["blue", "red", "yellow"]`, being truncated to `["blue", "yellow"]` instead of \["blue", "red"]\`. By [@dureuill](https://togithub.com/dureuill) in [#4929](https://togithub.com/meilisearch/meilisearch/issues/4929) - Improve the task cancellation when vectors are used, by [@irevoire](https://togithub.com/irevoire) in [#4971](https://togithub.com/meilisearch/meilisearch/issues/4971) - Swedish support: the characters `å`, `ä`, `ö` are no longer normalized to `a` and `o`. By [@ManyTheFish](https://togithub.com/ManyTheFish) in [#4945](https://togithub.com/meilisearch/meilisearch/issues/4945) - Update rhai to fix an internal error when [updating documents with a function](https://togithub.com/orgs/meilisearch/discussions/762) (experimental) by [@irevoire](https://togithub.com/irevoire) in [#4960](https://togithub.com/meilisearch/meilisearch/issues/4960) - Fix the bad experimental search queue size by [@irevoire](https://togithub.com/irevoire) in [#4992](https://togithub.com/meilisearch/meilisearch/issues/4992) - Do not send empty edit document by function by [@irevoire](https://togithub.com/irevoire) in [#5001](https://togithub.com/meilisearch/meilisearch/issues/5001) - Display vectors when no custom vectors were ever provided by [@dureuill](https://togithub.com/dureuill) in [#5008](https://togithub.com/meilisearch/meilisearch/issues/5008) ### Misc - Dependencies updates - Security dependency upgrade: bump quinn-proto from 0.11.3 to 0.11.8 by [@dependabot](https://togithub.com/dependabot) in [#4911](https://togithub.com/meilisearch/meilisearch/issues/4911) - CIs and tests - Make the tests run faster by [@irevoire](https://togithub.com/irevoire) in [#4808](https://togithub.com/meilisearch/meilisearch/issues/4808) - Documentation - Fix broken links in README by [@iornstein](https://togithub.com/iornstein) in [#4943](https://togithub.com/meilisearch/meilisearch/issues/4943) - Misc - Allow Meilitool to upgrade from v1.9 to v1.10 without a dump in some conditions, by [@dureuill](https://togithub.com/dureuill) in [#4912](https://togithub.com/meilisearch/meilisearch/issues/4912) - Fix bench by adding embedder by [@dureuill](https://togithub.com/dureuill) in [#4954](https://togithub.com/meilisearch/meilisearch/issues/4954) - Revamp analytics by [@irevoire](https://togithub.com/irevoire) in [#5011](https://togithub.com/meilisearch/meilisearch/issues/5011) ❤️ Thanks again to our external contributors: - [Meilisearch](https://togithub.com/meilisearch/meilisearchg): [@iornstein](https://togithub.com/iornstein). - [Charabia](https://togithub.com/meilisearch/charabia): [@luflow](https://togithub.com/luflow), [@tkhshtsh0917](https://togithub.com/tkhshtsh0917). ### [`v1.10.3`](https://togithub.com/meilisearch/meilisearch/releases/tag/v1.10.3): 🦩 [Compare Source](https://togithub.com/meilisearch/meilisearch/compare/v1.10.2...v1.10.3) #### Search improvements This PR lets you configure two behaviors of the engine through experimental cli flags: - The number of searches Meilisearch can process concurrently per core with the [`--experimental-nb-searches-per-core`](https://togithub.com/orgs/meilisearch/discussions/784) cli flag - After how many seconds Meilisearch can consider a search as irrelevant and drop it straight away without processing it with the [`--experimental-drop-search-after`](https://togithub.com/orgs/meilisearch/discussions/783) cli flag Done by [@irevoire](https://togithub.com/irevoire) in [https://github.com/meilisearch/meilisearch/pull/5000](https://togithub.com/meilisearch/meilisearch/pull/5000) **Full Changelog**: https://github.com/meilisearch/meilisearch/compare/v1.10.2...v1.10.3 ### [`v1.10.2`](https://togithub.com/meilisearch/meilisearch/releases/tag/v1.10.2): 🦩 [Compare Source](https://togithub.com/meilisearch/meilisearch/compare/v1.10.1...v1.10.2) #### Fixes 🦋 ##### Activate the Swedish tokenization Pipeline The Swedish tokenization pipeline were deactivated in the previous versions, now it is activated when specifying the index Language in the settings: ##### PATCH `/indexes/:index-name/settings` ```json { "localizedAttributes": [ { "locales": ["swe"], "attributePatterns": ["*"] } ] } ``` related PR: [#4949](https://togithub.com/meilisearch/meilisearch/issues/4949) ### [`v1.10.1`](https://togithub.com/meilisearch/meilisearch/releases/tag/v1.10.1): 🦩 [Compare Source](https://togithub.com/meilisearch/meilisearch/compare/v1.10.0...v1.10.1) #### Fixes 🦋 ##### Better search handling under heavy loads All of the next PR should make meilisearch behave better under heavy loads: - Only spawn one search queue in actix-web by [@irevoire](https://togithub.com/irevoire) in [https://github.com/meilisearch/meilisearch/pull/4893](https://togithub.com/meilisearch/meilisearch/pull/4893) - Make sure the index scheduler never stops running by [@irevoire](https://togithub.com/irevoire) in [https://github.com/meilisearch/meilisearch/pull/4896](https://togithub.com/meilisearch/meilisearch/pull/4896) - Explicitly drop the search permits by [@irevoire](https://togithub.com/irevoire) in [https://github.com/meilisearch/meilisearch/pull/4898](https://togithub.com/meilisearch/meilisearch/pull/4898) - Stop trying to process searches after one minute by [@irevoire](https://togithub.com/irevoire) in [https://github.com/meilisearch/meilisearch/pull/4899](https://togithub.com/meilisearch/meilisearch/pull/4899) #### Speed improvement 🐎 We made the autobatching of the document deletion with the document deletion by filter possible which should uncklog the task queue of the people using these two operations heavily. Meilisearch still cannot autobatch the document deletion by filter and the document addition, though. - Autobatch document deletion by filter by [@irevoire](https://togithub.com/irevoire) in [https://github.com/meilisearch/meilisearch/pull/4901](https://togithub.com/meilisearch/meilisearch/pull/4901) - Do not fail the whole batch when a single document deletion by filter fails by [@irevoire](https://togithub.com/irevoire) in [https://github.com/meilisearch/meilisearch/pull/4905](https://togithub.com/meilisearch/meilisearch/pull/4905) **Full Changelog**: https://github.com/meilisearch/meilisearch/compare/v1.10.0...v1.10.1 ### [`v1.10.0`](https://togithub.com/meilisearch/meilisearch/releases/tag/v1.10.0): 🦩 [Compare Source](https://togithub.com/meilisearch/meilisearch/compare/v1.9.1...v1.10.0) Meilisearch v1.10 introduces federated search. This innovative feature allows you to receive a single list of results for multi-search requests. v1.10 also includes a setting to manually define which language or languages are present in your documents, and two new new experimental features: the `CONTAINS` filter operator and the ability to update a subset of your dataset with a function. 🧰 All official Meilisearch integrations (including SDKs, clients, and other tools) are compatible with this Meilisearch release. Integration deployment happens between 4 to 48 hours after a new version becomes available. Some SDKs might not include all new features. Consult the project repository for detailed information. Is a feature you need missing from your chosen SDK? Create an issue letting us know you need it, or, for open-source karma points, open a PR implementing it (we'll love you for that ❤️). ### New features and updates 🔥 #### Federated search Use the new `federation` setting of the `/multi-search` route to return a single search result object: ```bash curl \ -X POST 'http://localhost:7700/multi-search' \ -H 'Content-Type: application/json' \ --data-binary '{ "federation": { "offset": 5, "limit": 10 } "queries": [ { "q": "Batman", "indexUid": "movies" }, { "q": "Batman", "indexUid": "comics" } ] }' ``` Response: ```json5 { "hits": [ { "id": 42, "title": "Batman returns", "overview": "..", "_federation": { "indexUid": "movies", "queriesPosition": 0 } }, { "comicsId": "batman-killing-joke", "description": "..", "title": "Batman: the killing joke", "_federation": { "indexUid": "comics", "queriesPosition": 1 } }, … ], processingTimeMs: 0, limit: 20, offset: 0, estimatedTotalHits: 2, semanticHitCount: 0, } ``` When performing a federated search, Meilisearch merges the results coming from different sources in descending ranking score order. If `federation` is empty (`{}`), Meilisearch sets `offset` and `limit` to 0 and 20 respectively. If `federation` is `null` or missing, multi-search returns one list of search result objects for each index. ##### Federated results relevancy When performing federated searches, use `federationOptions` in the request's `queries` array to configure the relevancy and the weight of each index: ```bash curl \ -X POST 'http://localhost:7700/multi-search' \ -H 'Content-Type: application/json' \ --data-binary '{ "federation": {}, "queries": [ { "q": "apple red", "indexUid": "fruits", "filter": "BOOSTED = true", "_showRankingScore": true, "federationOptions": { "weight": 3.0 } }, { "q": "apple red", "indexUid": "fruits", "_showRankingScore": true, } ] }' ``` `federationOptions` must be an object. It supports a single field, `weight`, which must be a positive floating-point number: - if `weight` < `1.0`, results from this index are **less** likely to appear in the results - if `weight` > `1.0`, results from this index are **more** likely to appear in the results - if not specified, `weight` defaults to `1.0` 📖 Consult the [usage page](https://meilisearch.notion.site/v1-10-federated-search-698dfe36ab6b4668b044f735fb40f0b2) for more information about the merge algorithm. Done by [@dureuill](https://togithub.com/dureuill) in [#4769](https://togithub.com/meilisearch/meilisearch/issues/4769). #### Experimental: `CONTAINS` filter operator Enable the `containsFilter` experimental feature to use the `CONTAINS` filter operator: ```bash curl \ -X PATCH 'http://localhost:7700/experimental-features/' \ -H 'Content-Type: application/json' \ --data-binary '{ "containsFilter": true }' ``` `CONTAINS` filters results containing partial matches to the specified string, similar to a SQL `LIKE`: ```bash curl \ -X POST http://localhost:7700/indexes/movies/search \ -H 'Content-Type: application/json' \ --data-binary '{ "q": "super hero", "filter": "synopsis CONTAINS spider" }' ``` 🗣️ This is an experimental feature, and we need your help to improve it! Share your thoughts and feedback on this [GitHub discussion](https://togithub.com/orgs/meilisearch/discussions/763). Done by [@irevoire](https://togithub.com/irevoire) in [#4804](https://togithub.com/meilisearch/meilisearch/issues/4804). #### Language settings Use the new `localizedAttributes` index setting and the `locales` search parameter to explicitly set the languages used in document fields and the search query itself. This is particularly useful for <=v1.9 users who have to occasionally resort to alternative Meilisearch images due to language auto-detect issues in [Swedish](https://togithub.com/meilisearch/meilisearch/pull/4604) and [Japanese](https://togithub.com/meilisearch/meilisearch/pull/3882) datasets. Done by [@ManyTheFish](https://togithub.com/ManyTheFish) in [#4819](https://togithub.com/meilisearch/meilisearch/issues/4819). ##### Set language during indexing with `localizedAttributes` Use the newly introduced `localizedAttributes` setting to explicitly declare which languages correspond to which document fields: ```bash curl \ -X PATCH 'http://localhost:7700/indexes/movies/settings' \ -H 'Content-Type: application/json' \ --data-binary '{ "localizedAttributes": [ {"locales": ["jpn"], "attributePatterns": ["*_ja"]}, {"locales": ["eng"], "attributePatterns": ["*_en"]}, {"locales": ["cmn"], "attributePatterns": ["*_zh"]}, {"locales": ["fra", "ita"], "attributePatterns": ["latin.*"]}, {"locales": [], "attributePatterns": ["*"]} ] }' ``` `locales` is a list of ISO-639-3 language codes to assign to a pattern. The currently supported languages are: `epo`, `eng`, `rus`, `cmn`, `spa`, `por`, `ita`, `ben`, `fra`, `deu`, `ukr`, `kat`, `ara`, `hin`, `jpn`, `heb`, `yid`, `pol`, `amh`, `jav`, `kor`, `nob`, `dan`, `swe`, `fin`, `tur`, `nld`, `hun`, `ces`, `ell`, `bul`, `bel`, `mar`, `kan`, `ron`, `slv`, `hrv`, `srp`, `mkd`, `lit`, `lav`, `est`, `tam`, `vie`, `urd`, `tha`, `guj`, `uzb`, `pan`, `aze`, `ind`, `tel`, `pes`, `mal`, `ori`, `mya`, `nep`, `sin`, `khm`, `tuk`, `aka`, `zul`, `sna`, `afr`, `lat`, `slk`, `cat`, `tgl`, `hye`. `attributePattern` is a pattern that can start or end with a `*` to match one or several attributes. If an attribute matches several rules, only the first rule in the list will be applied. If the locales list is empty, then Meilisearch is allowed to auto-detect any language in the matching attributes. These rules are applied to the `searchableAttributes`, the `filterableAttributes`, and the `sortableAttributes`. ##### Set language at search time with `locales` The `/search` route accepts a new parameter, `locales`. Use it to define the language used in the current query: ```bash curl \ -X POST http://localhost:7700/indexes/movies/search \ -H 'Content-Type: application/json' \ --data-binary '{ "q": "進撃の巨人", "locales": ["jpn"] }' ``` The `locales` parameter overrides eventual `locales` in the index settings. #### Experimental: Edit documents with a Rhai function Use a [Rhai function](https://rhai.rs/) to edit documents in your database directly from Meilisearch: First, activate the experimental feature: ```bash curl \ -X PATCH 'http://localhost:7700/experimental-features/' \ -H 'Content-Type: application/json' \ --data-binary '{ "editDocumentsByFunction": true }' ``` Then query the `/documents/edit` route with the editing function: ```bash curl http://localhost:7700/indexes/movies/documents/edit \ -H 'content-type: application/json' \ -d '{ "function": "doc.title = `✨ ${doc.title.to_upper()} ✨`", "filter": "id > 3000" }' ``` `/documents/edit` accepts three parameters in its payload: `function`, `filter`, and `context`. `function` must be a string with a Rhai function. `filter` must be a [filter expression.](https://www.meilisearch.com/docs/learn/filtering_and_sorting/filter_expression_reference). `context` must be an object with data you want to make available for the editing function. 📖 More information [here](https://meilisearch.notion.site/Update-Documents-by-Function-0cff8fea7655436592e7c8a6de932062). 🗣️ This is an experimental feature and we need your help to improve it! Share your thoughts and feedback on this [GitHub discussion](https://togithub.com/orgs/meilisearch/discussions/762). Done by [@Kerollmops](https://togithub.com/Kerollmops) in [#4626](https://togithub.com/meilisearch/meilisearch/issues/4626). #### Experimental AI-powered search: quality of life improvements For the purpose of future stabilization of the feature, we are applying changes and quality-of-life improvements. Done by [@dureuill](https://togithub.com/dureuill) in [#4801](https://togithub.com/meilisearch/meilisearch/issues/4801), [#4815](https://togithub.com/meilisearch/meilisearch/issues/4815), [#4818](https://togithub.com/meilisearch/meilisearch/issues/4818), [#4822](https://togithub.com/meilisearch/meilisearch/issues/4822). ##### ⚠️ Breaking changes: Changing the parameters of the REST API The old parameters of the REST API are too numerous and confusing. Removed parameters: `query` , `inputField`, `inputType`, `pathToEmbeddings` and `embeddingObject`. Replaced by: - `request` : A JSON value that represents the request made by Meilisearch to the remote embedder. The text to embed must be replaced by the placeholder value `“{{text}}”`. - `response`: A JSON value that represents a fragment of the response made by the remote embedder to Meilisearch. The embedding must be replaced by the placeholder value `"{{embedding}}"`. Before: ```json5 // v1.10 version ✅ { "source": "rest", "url": "https://localhost:10006", "request": { "model": "minillm", "prompt": "{{text}}" }, "response": { "embedding": "{{embedding}}" } } ``` ```json5 // v1.9 version ❌ { "source": "rest", "url": "https://localhost:10006", "query": { "model": "minillm", }, "inputField": ["prompt"], "inputType": "text", "embeddingObject": ["embedding"] } ``` > \[!CAUTION] > This is a breaking change to the configuration of REST embedders. > Importing a dump containing a REST embedder configuration will fail in v1.10 with an error: "Error: unknown field `query`, expected one of `source`, `model`, `revision`, `apiKey`, `dimensions`, `documentTemplate`, `url`, `request`, `response`, `distribution` at line 1 column 752". Upgrade procedure: 1. Remove embedders with source `"rest"` 2. Update your [Meilisearch Cloud project](https://www.meilisearch.com/docs/learn/update_and_migration/updating#updating-meilisearch-cloud) or [self-hosted Meilisearch instance](https://www.meilisearch.com/docs/learn/update_and_migration/updating#updating-a-self-hosted-meilisearch-instance) as usual ##### Add custom headers to REST embedders When the `source` of an embedder is set to `rest`, you may include an optional `headers` parameter. Use this to configure custom headers you want Meilisearch to include in the requests it sends the embedder. Embedding requests sent from Meilisearch to a remote REST embedder always contain two headers: - `Authorization: Bearer <apiKey>` (only if `apiKey` was provided) - `Content-Type: application/json` When provided, `headers` should be a JSON object whose keys represent the name of additional headers to send in requests, and the values represent the value of these additional headers. If `headers` is missing or `null` for a `rest` embedder, only `Authorization` and `Content-Type` are sent, as described above. If `headers` contains `Authorization` and `Content-Type`, the declared values will override the ones that are sent by default. Using the `headers` parameter for any other `source` besides `rest` results in an `invalid_settings_embedder` error. ##### Other quality-of-life improvements 📖 More details [here](https://meilisearch.notion.site/v1-10-AI-search-changes-737c9d7d010d4dd685582bf5dab579e2) - Add `url` parameter to the OpenAI embedder. `url` should be an URL to the embedding endpoint (including the v1/embeddingspart) from OpenAI. If `url` is missing or `null` for an `openAi` embedder, the default OpenAI embedding route will be used (https://api.openai.com/v1/embeddings). - `dimensions` is now available as an optional parameter for `ollama` embedders. Previously it was only available for rest, `openAi` and `userProvided` embedders. - Previously `_vectors.embedder` was omitted for documents without at least one embedding for `embedder`. This was inconsistent and prevented the user from checking the value of `regenerate`. - When a request to a REST embedder fails, the duration of the exponential backoff is now randomized up to twice its base duration - Truncate rather than embed by chunk when OpenAI embeddings are bigger than the max number of tokens - Improve error message when indexing documents and embeddings are missing for a user-provided embedder - Improve error message when a model configuration cannot be loaded and its "architectures" field does not contain "BertModel" #### ⚠️ Important change regarding the minimal Ubuntu version compatible with Meilisearch Because the GitHub Actions runner now enforces the usage of a Node version that is not compatible with Ubuntu 18.04 anymore, we had to upgrade the minimal Ubuntu version compatible with Meilisearch. Indeed, we use these GitHub actions to build and provide our binaries. Now, Meilisearch is only compatible with Ubuntu 20.04 and later and not with Ubuntu 18.4 anymore. Done by [@curquiza](https://togithub.com/curquiza) in [#4783](https://togithub.com/meilisearch/meilisearch/issues/4783). #### Other improvements - Search speed optimization: implement intersection at the end of the search pipeline by [@Kerollmops](https://togithub.com/Kerollmops) in [#4717](https://togithub.com/meilisearch/meilisearch/issues/4717) - Indexing speed optimization: stop opening indexes to only check if they exist by [@Karribalu](https://togithub.com/Karribalu) in [#4787](https://togithub.com/meilisearch/meilisearch/issues/4787) - Improve tenant token error messages by [@irevoire](https://togithub.com/irevoire) in [#4724](https://togithub.com/meilisearch/meilisearch/issues/4724) - Add null byte as hard context separator by [@LukasKalbertodt](https://togithub.com/LukasKalbertodt) in [meilisearch/charabia#295](https://togithub.com/meilisearch/charabia/issues/295) - Adds all [math symbols](https://www.compart.com/en/unicode/category/Sm) to the default separator list by [@phillitrOSU](https://togithub.com/phillitrOSU) in [meilisearch/charabia#301](https://togithub.com/meilisearch/charabia/issues/301) - Errors emitted at the main level of the Meilisearch binary are now logged with level `ERROR` by [@dureuill](https://togithub.com/dureuill) in [#4835](https://togithub.com/meilisearch/meilisearch/issues/4835) ### Fixes 🐞 - Fix invalid primary key for big numbers [@JWSong](https://togithub.com/JWSong) in [#4725](https://togithub.com/meilisearch/meilisearch/issues/4725) - Fix wrong HTTP status and confusing error message on wrong payload by [@Karribalu](https://togithub.com/Karribalu) in [#4716](https://togithub.com/meilisearch/meilisearch/issues/4716) - Fix the missing geo distance when one or both of the lat/lng are string by [@irevoire](https://togithub.com/irevoire) in [#4731](https://togithub.com/meilisearch/meilisearch/issues/4731) - Fix errors related to `OffsetDateTime`: use a fixed date format regardless of features by [@dureuill](https://togithub.com/dureuill) in [#4850](https://togithub.com/meilisearch/meilisearch/issues/4850) - Fix filter that doesn't return valid documents by [@dureuill](https://togithub.com/dureuill) in [#4864](https://togithub.com/meilisearch/meilisearch/issues/4864) & [#4858](https://togithub.com/meilisearch/meilisearch/issues/4858) ### Misc - Dependencies updates - Update most of the dependencies by [@irevoire](https://togithub.com/irevoire) in [#4786](https://togithub.com/meilisearch/meilisearch/issues/4786) - Update yaup by [@irevoire](https://togithub.com/irevoire) in [#4703](https://togithub.com/meilisearch/meilisearch/issues/4703) - Bump docker/build-push-action from 5 to 6 by [@dependabot](https://togithub.com/dependabot) in [#4758](https://togithub.com/meilisearch/meilisearch/issues/4758) - Bump zerovec from 0.10.1 to 0.10.4 by [@dependabot](https://togithub.com/dependabot) in [#4785](https://togithub.com/meilisearch/meilisearch/issues/4785) - Update rustls as much as possible by [@irevoire](https://togithub.com/irevoire) in [#4806](https://togithub.com/meilisearch/meilisearch/issues/4806) - CIs and tests - Fix CI with Rust v1.79 by [@dureuill](https://togithub.com/dureuill) in [#4723](https://togithub.com/meilisearch/meilisearch/issues/4723) - Fix flaky test by [@irevoire](https://togithub.com/irevoire) in [#4730](https://togithub.com/meilisearch/meilisearch/issues/4730) - Specify the rust toolchain by [@irevoire](https://togithub.com/irevoire) in [#4706](https://togithub.com/meilisearch/meilisearch/issues/4706) - Add `vX` Docker tag when publishing Docker image by [@curquiza](https://togithub.com/curquiza) in [#4761](https://togithub.com/meilisearch/meilisearch/issues/4761) - Add search benchmarks by [@dureuill](https://togithub.com/dureuill) in [#4762](https://togithub.com/meilisearch/meilisearch/issues/4762) - Add tests on the rest embedder by [@irevoire](https://togithub.com/irevoire) and [@dureuill](https://togithub.com/dureuill) in [#4755](https://togithub.com/meilisearch/meilisearch/issues/4755) - Add OpenAI tests by [@dureuill](https://togithub.com/dureuill) in [#4846](https://togithub.com/meilisearch/meilisearch/issues/4846) - Documentation - Add june 11th webinar banner by [@Strift](https://togithub.com/Strift) in [#4691](https://togithub.com/meilisearch/meilisearch/issues/4691) - Revert "Add june 11th webinar banner" by [@curquiza](https://togithub.com/curquiza) in [#4705](https://togithub.com/meilisearch/meilisearch/issues/4705) - Update the README to link more demos by [@Kerollmops](https://togithub.com/Kerollmops) in [#4711](https://togithub.com/meilisearch/meilisearch/issues/4711) - Update README.md by [@Strift](https://togithub.com/Strift) in [#4721](https://togithub.com/meilisearch/meilisearch/issues/4721) - Change the Meilisearch logo to the kawaii version by [@Kerollmops](https://togithub.com/Kerollmops) in [#4778](https://togithub.com/meilisearch/meilisearch/issues/4778) - Misc - New workload to ignore the initial compression phase by [@Kerollmops](https://togithub.com/Kerollmops) in [#4773](https://togithub.com/meilisearch/meilisearch/issues/4773) - Rename the sortable into the filterable movies workload by [@Kerollmops](https://togithub.com/Kerollmops) in [#4774](https://togithub.com/meilisearch/meilisearch/issues/4774) - Correct apk usages in Dockerfile by [@PeterDaveHello](https://togithub.com/PeterDaveHello) in [#4781](https://togithub.com/meilisearch/meilisearch/issues/4781) - Make milli use edition 2021 by [@hanbings](https://togithub.com/hanbings) in [#4770](https://togithub.com/meilisearch/meilisearch/issues/4770) - Allow `MEILI_NO_VERGEN` env var to skip vergen by [@dureuill](https://togithub.com/dureuill) in [#4812](https://togithub.com/meilisearch/meilisearch/issues/4812) ❤️ Thanks again to our external contributors: - [Meilisearch](https://togithub.com/meilisearch/meilisearch): [@Karribalu](https://togithub.com/Karribalu), [@hanbings](https://togithub.com/hanbings), [@junhochoi](https://togithub.com/junhochoi), [@JWSong](https://togithub.com/JWSong), [@PeterDaveHello](https://togithub.com/PeterDaveHello). - [Charabia](https://togithub.com/meilisearch/charabia): [@LukasKalbertodt](https://togithub.com/LukasKalbertodt), [@phillitrOSU](https://togithub.com/phillitrOSU). ### [`v1.9.1`](https://togithub.com/meilisearch/meilisearch/releases/tag/v1.9.1): 🦎 [Compare Source](https://togithub.com/meilisearch/meilisearch/compare/v1.9.0...v1.9.1) #### Fixes 🪲 - Return an empty list of embeddings for embedders that have no document for an embedder. by [@dureuill](https://togithub.com/dureuill) in [https://github.com/meilisearch/meilisearch/pull/4889](https://togithub.com/meilisearch/meilisearch/pull/4889) This fixes an issue where dumps created for indexes with: 1. A user-provided embedder 2. At least one documents that opt-out of vectors for that user-provided embedder would fail to import correctly. #### Upgrade path to v1.10.0 🚀 If you are a Cloud user affected by the above issue, please contact customer support so we perform the upgrade for you. If you are an OSS user affected by the above, perform the following operations: 1. Upgrade from v1.9.0 to v1.9.1 without using a dump 2. Upgrade to v1.10.0 using a dump created from v1.9.1 [**Full Changelog**](https://togithub.com/meilisearch/meilisearch/compare/v1.9.0...v1.9.1) ### [`v1.9.0`](https://togithub.com/meilisearch/meilisearch/releases/tag/v1.9.0): 🦎 [Compare Source](https://togithub.com/meilisearch/meilisearch/compare/v1.8.4...v1.9.0) Meilisearch v1.9 includes performance improvements for hybrid search and the addition/updating of settings. This version benefits from multiple requested features, such as the new `frequency` matching strategy and the ability to retrieve similar documents. 🧰 All official Meilisearch integrations (including SDKs, clients, and other tools) are compatible with this Meilisearch release. Integration deployment happens between 4 to 48 hours after a new version becomes available. Some SDKs might not include all new features. Consult the project repository for detailed information. Is a feature you need missing from your chosen SDK? Create an issue letting us know you need it, or, for open-source karma points, open a PR implementing it (we'll love you for that ❤️). ### New features and updates 🔥 #### Hybrid search updates This release introduces multiple [hybrid search updates](https://meilisearch.notion.site/v1-9-AI-search-changes-e90d6803eca8417aa70a1ac5d0225697#38e6d3adf40e4ef1be14a3c4be39df94). Done by [@dureuill](https://togithub.com/dureuill) and [@irevoire](https://togithub.com/irevoire) in [#4633](https://togithub.com/meilisearch/meilisearch/issues/4633) and [#4649](https://togithub.com/meilisearch/meilisearch/issues/4649) ##### ⚠️ Breaking change: Empty `_vectors.embedder` arrays Empty `_vectors.embedder` arrays are now interpreted as having no vector embedding. Before v1.9, Meilisearch interpreted these as a single embedding of dimension 0. This change follows user feedback that the previous behavior was unexpected and unhelpful. ##### ⚠️ Breaking change: `_vectors` field no longer present in search results When the experimental `vectorStore` feature is enabled, Meilisearch no longer includes `_vectors` in returned search results by default. This will considerably improve performance. Use the new `retrieveVectors` search parameter to display the `_vectors` field: ```sh curl \ -X POST 'http://localhost:7700/indexes/INDEX_NAME/search' \ -H 'Content-Type: application/json' \ --data-binary '{ "q": "SEARCH QUERY", "retrieveVectors": true }' ``` ##### ⚠️ Breaking change: Meilisearch no longer preserves the exact representation of embeddings appearing in `_vectors` In order to save storage and run faster, Meilisearch is no longer storing your vector "as-is". Meilisearch now returns the float in a canonicalized representation rather than the user-provided representation. For example, `3` may be represented as `3.0` ##### Document `_vectors` accepts object values The document `_vectors` field now accepts objects in addition to embedding arrays: ```json { "id": 42, "_vectors": { "default": [0.1, 0.2 ], "text": { "embeddings": [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]], "regenerate": false }, "translation": { "embeddings": [0.1, 0.2, 0.3, 0.4], "regenerate": true } } } ``` The `_vectors` object may contain two fields: `embeddings` and `regenerate`. If present, `embeddings` will replace this document's embeddings. `regenerate` must be either `true` or `false`. If `regenerate: true`, Meilisearch will overwrite the document embeddings each time the document is updated in the future. If `regenerate: false`, Meilisearch will keep the last provided or generated embeddings even if the document is updated in the future. This change allows importing embeddings to autoembedders as a one-shot process, by setting them as `regenerate: true`. This change also ensures embeddings are not regenerated when importing a dump created with Meilisearch v1.9. Meilisearch v1.9.0 also improves performance when indexing and using hybrid search, avoiding useless operations and optimizing the important ones. #### New feature: Ranking score threshold Use `rankingScoreThreshold` to exclude search results with low ranking scores: ```bash curl \ -X POST 'http://localhost:7700/indexes/movies/search' \ -H 'Content-Type: application/json' \ --data-binary '{ "q": "Badman dark returns 1", "showRankingScore": true, "limit": 5, "rankingScoreThreshold": 0.2 }' ``` Meilisearch does not return any documents below the configured threshold. Excluded results do not count towards `estimatedTotalHits`, `totalHits`, and facet distribution. ⚠️ For performance reasons, if the number of documents above `rankingScoreThreshold` is higher than `limit`, Meilisearch does not evaluate the ranking score of the remaining documents. Results ranking below the threshold are not immediately removed from the set of candidates. In this case, Meilisearch may overestimate the count of `estimatedTotalHits`, `totalHits` and facet distribution. Done by [@dureuill](https://togithub.com/dureuill) in [#4666](https://togithub.com/meilisearch/meilisearch/issues/4666) #### New feature: Get similar documents endpoint This release introduces a new AI-powered search feature allowing you to send a document to Meilisearch and receive a list of similar documents in return. Use the `/indexes/{indexUid}/similar` endpoint to query Meilisearch for related documents: ```sh curl \ -X POST /indexes/:indexUid/similar -H 'Content-Type: application/json' \ --data-binary '{ "id": "23", "offset": 0, "limit": 2, "filter": "release_date > 1521763199", "embedder": "default", "attributesToRetrieve": [], "showRankingScore": false, "showRankingScoreDetails": false }' ``` - `id`: string indicating the document needing similar results, required - `offset`: number of results to skip when paginating, optional, defaults to `0` - `limit`: number of results to display, optional, defaults to `20` - `filter`: string with a filter expression Meilisearch should apply to the results, optional, defaults to `null` - `embedder`: string indicating the embedder Meilisearch should use to retrieve similar documents, optional, defaults to `"default"` - `attributesToRetrieve`: array of strings ind </details> --- ### Configuration 📅 **Schedule**: Branch creation - "before 4am on the first day of the month" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://togithub.com/renovatebot/renovate).

feat: Adds German composition words decompound

434edde

feat: Adds some long sample words, change min suffix length to 4

98dbb6a

luflow added 4 commits August 10, 2024 17:06

feat: Adds some more words to german dictionary

6461aa4

fix: Allow max 3 characters remaining for suffix

323a47f

feat: Updates german benchmark

251b1e8

Adds ambiguous german city names to dictionary

f27587b

curquiza requested a review from ManyTheFish August 12, 2024 08:08

luflow added 3 commits August 12, 2024 14:07

Fixes clippy issues

c5e755b

Adds another dictionary word

fe43163

Fixes rust fmt issues

e534194

ManyTheFish requested changes Aug 27, 2024

View reviewed changes

charabia/src/segmenter/mod.rs Show resolved Hide resolved

charabia/src/segmenter/mod.rs Show resolved Hide resolved

charabia/src/segmenter/mod.rs Outdated Show resolved Hide resolved

luflow and others added 5 commits August 28, 2024 09:09

Adds german-segmentation feature flag

3882ec4

Introduces new options in fst segmenter to allow character splitting …

83089b9

…and min lemma length definition

Uses fst segmenter instead of hash map for hihger efficiency

257972f

Fixes rust fmt issues

f6999c6

Merge branch 'main' into feature/german-compound-words

61634c9

luflow requested a review from ManyTheFish August 28, 2024 11:16

ManyTheFish previously approved these changes Sep 9, 2024

View reviewed changes

ManyTheFish reviewed Sep 9, 2024

View reviewed changes

charabia/src/segmenter/utils.rs Outdated Show resolved Hide resolved

Update charabia/src/segmenter/utils.rs

d65941f

Co-authored-by: Many the fish <many@meilisearch.com>

luflow dismissed ManyTheFish’s stale review via d65941f September 9, 2024 09:53

luflow requested a review from ManyTheFish September 9, 2024 09:54

Fixes clippy issues

8523fa8

ManyTheFish approved these changes Sep 10, 2024

View reviewed changes

meili-bors bot merged commit 38b8529 into meilisearch:main Sep 10, 2024
4 checks passed

luflow deleted the feature/german-compound-words branch September 10, 2024 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adds German compound words decomposition with new segmenter #303

feat: Adds German compound words decomposition with new segmenter #303

luflow commented Aug 9, 2024

luflow commented Aug 9, 2024

luflow commented Aug 10, 2024

luflow commented Aug 12, 2024 •

edited

Loading

ManyTheFish left a comment

luflow commented Aug 27, 2024

ManyTheFish commented Aug 27, 2024

luflow commented Aug 28, 2024

luflow commented Sep 7, 2024

ManyTheFish left a comment

meili-bors bot commented Sep 9, 2024

luflow commented Sep 9, 2024

ManyTheFish commented Sep 9, 2024

luflow commented Sep 9, 2024

ManyTheFish left a comment

meili-bors bot commented Sep 10, 2024

feat: Adds German compound words decomposition with new segmenter #303

feat: Adds German compound words decomposition with new segmenter #303

Conversation

luflow commented Aug 9, 2024

Pull Request

What does this PR do?

PR checklist

luflow commented Aug 9, 2024

luflow commented Aug 10, 2024

luflow commented Aug 12, 2024 • edited Loading

ManyTheFish left a comment

Choose a reason for hiding this comment

luflow commented Aug 27, 2024

ManyTheFish commented Aug 27, 2024

luflow commented Aug 28, 2024

luflow commented Sep 7, 2024

ManyTheFish left a comment

Choose a reason for hiding this comment

meili-bors bot commented Sep 9, 2024

luflow commented Sep 9, 2024

ManyTheFish commented Sep 9, 2024

luflow commented Sep 9, 2024

ManyTheFish left a comment

Choose a reason for hiding this comment

meili-bors bot commented Sep 10, 2024

luflow commented Aug 12, 2024 •

edited

Loading