Latin camelcase wrong segmentation #317
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request
All interesting changes are in charabia/src/segmenter/latin/camel_case.rs
other changes are import reordering caused by cargo fmt.
This change definitely looks like a small speed regression, but I will be opening for review nonetheless.
First time contributor to any meilisearch repo.
Related issue
Fixes #289
What does this PR do?
This PR changes the Latin camelCase segmentation to address the following cases:
This is done by introducing a helper iteration for the main iterator to "peek" the next char, and this is needed in the currently solution with
StrGroupBy::linear_group_by
afaik since the segmentation sometimes depends on the existence of a lowercase letter after the one currently being analyzed ( like openSSLError )Benchmarks
the benchmarks ran once WITH the changes, and once more after commenting the changes out. The output here is from after removing the changes
And what I believe is the relevant output is the following
PR checklist
Please check if your PR fulfills the following requirements: