Numbers are not segmented the same way depending on the Script/Language #271

ManyTheFish · 2024-02-15T12:39:10Z

Description

The string hello 95532387 is segmented as "hello", " ", "95532387" where the string สปริงเกียร์ร่วม 95532387 will be segmented as "สปร\u{e34}ง", "เก\u{e35}ยร\u{e4c}", "ร\u{e48}วม", " ", "9", "5", "5", "3", "2", "3", "8", "7".

Expected behavior

The number 95532387 should always be segmented as "95532387".

How to solve the issue

Before calling a specialized segmenter during the segmentation process, we should check if the interleaved string only contains numbers, if not we call the specialized segmenter else we return directly Some(s).

---- segmenter::japanese::test::segmenter_segment_str stdout ----
thread 'segmenter::japanese::test::segmenter_segment_str' panicked at charabia/src/segmenter/japanese.rs:144:5:
assertion `left == right` failed: 
Segmenter JapaneseSegmenter didn't segment the text as expected.

help: the `segmented` text provided to `test_segmenter!` does not corresponds to the output of the tested segmenter, it's probably due to a bug in the segmenter or a mistake in the provided segmented text.

  left: ["関西", "国際", "空港", "限定", "トート", "バッグ", " ", "1", "2", "3", "4", " ", "すもも", "も", "もも", "も", "もも", "の", "うち"]
 right: ["関西", "国際", "空港", "限定", "トート", "バッグ", " ", "1234", " ", "すもも", "も", "もも", "も", "もも", "の", "うち"]

The number "1234" is getting split into individual digit strings. Can you help me with what extra changes could be done for this? I tried a few things but failed in that.

ManyTheFish · 2024-04-03T08:25:52Z

Hello @239yash,
Why don't you go for something simpler, like:

if s.chars().all(|c| c.is_numeric() || c.is_ponctuation()) {

You may know that the separators are customizable in Charabia, meaning that they have already been processed before calling this part of the code.
Let's say the . is part of the separators, then the text 123.456 will be preprocessed as ["123", ".", "456"].

For the test case, it's expected that the digit characters will now be joined together.

42plamusse · 2024-05-02T07:36:52Z

Hello @ManyTheFish,
I trying my luck on this first issue and it led me to 2 questions:

Are the floating point numbers currently supported ?
In the latin segmenter tests 32.3 is expected to become ["32", ".", "3"].
The segmenter_segment_str test is using AhoSegmentedStrIter directly and not SegmentedStrIter, so languages like Thai using the FST_SEGMENTER won't pass the integer test with the fix you proposed. Is it a fix issue or is it a test issue ?

It looks like I got to the same point than @239yash, I will be happy to collaborate if they are still on this issue.

ManyTheFish · 2024-05-02T14:02:21Z

Hello @42plamusse,

Are the floating point numbers currently supported ? In the latin segmenter tests 32.3 is expected to become ["32", ".", "3"].

Yes, you're right, but the test uses the default separator set, including . which separates the lemmas linked by it. However, this set is customizable, and . could be removed from the list, meaning that it shouldn't be separated anymore, so it's possible to receive ["32.3"] but not by default.

The segmenter_segment_str test is using AhoSegmentedStrIter directly and not SegmentedStrIter, so languages like Thai using the FST_SEGMENTER won't pass the integer test with the fix you proposed. Is it a fix issue or is it a test issue?

Yes you're right, the test macro should be updated to separate segmenter_segment_str expected output from segment. 🤔

In an another hand, we could remove numbers from the tests in the specialized tokenizer.

dqkqd · 2024-10-06T12:01:41Z

@ManyTheFish
Hi, I have created a PR for this issue.

Instead of checking numbers in SegmentedStrIter, I check them in AhoSegmentedStrIter and return a Match. Doing this way could reduce the amount of code changes because not every tests access SegmentedStrIter.
I added 2 more tests for numbers for segmenter and tokenizer as well.

For checking number, I used the method proposed above.

s.chars().all(|c| c.is_numeric() || c.is_ascii_punctuation())

However, I don't think this is a very elegant way to do, because it can fail or give incorrect result in some cases (e.g 1e5, 1.2.3)
At first, I intended to use parse::64 but this can fail if the string is too long.

zaira-bibi · 2024-10-10T08:49:40Z

Hi! If @dqkqd's PR gets merged, I'll be happy to take it up further to handle the cases that they mentioned give incorrect results.

ManyTheFish added bug Something isn't working good first issue Good for newcomers labels Feb 15, 2024

ManyTheFish mentioned this issue Feb 15, 2024

Can't search documents with number in the string even separated white space meilisearch/meilisearch#4412

Open

dqkqd added a commit to dqkqd/charabia that referenced this issue Oct 6, 2024

fix: Segment number into word instead of chars (meilisearch#271)

588439c

dqkqd mentioned this issue Oct 6, 2024

fix: Segment number into word instead of chars (#271) #311

Merged

3 tasks

meili-bors bot closed this as completed in 1b48ada Oct 14, 2024

ManyTheFish mentioned this issue Nov 27, 2024

Update Charabia on Meilisearch v1.12.0 meilisearch/meilisearch#5097

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numbers are not segmented the same way depending on the Script/Language #271

Numbers are not segmented the same way depending on the Script/Language #271

ManyTheFish commented Feb 15, 2024

Abastien1734 commented Feb 28, 2024

irevoire commented Feb 28, 2024

muffpy commented Mar 18, 2024

Abastien1734 commented Mar 18, 2024

239yash commented Mar 27, 2024

ManyTheFish commented Mar 28, 2024

239yash commented Mar 28, 2024

239yash commented Mar 31, 2024 •

edited

Loading

ManyTheFish commented Apr 3, 2024

42plamusse commented May 2, 2024

ManyTheFish commented May 2, 2024

dqkqd commented Oct 6, 2024

zaira-bibi commented Oct 10, 2024

Numbers are not segmented the same way depending on the Script/Language #271

Numbers are not segmented the same way depending on the Script/Language #271

Comments

ManyTheFish commented Feb 15, 2024

Description

Expected behavior

How to solve the issue

Related

Abastien1734 commented Feb 28, 2024

irevoire commented Feb 28, 2024

muffpy commented Mar 18, 2024

Abastien1734 commented Mar 18, 2024

239yash commented Mar 27, 2024

ManyTheFish commented Mar 28, 2024

239yash commented Mar 28, 2024

239yash commented Mar 31, 2024 • edited Loading

ManyTheFish commented Apr 3, 2024

42plamusse commented May 2, 2024

ManyTheFish commented May 2, 2024

dqkqd commented Oct 6, 2024

zaira-bibi commented Oct 10, 2024

239yash commented Mar 31, 2024 •

edited

Loading