-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numbers are not segmented the same way depending on the Script/Language #271
Comments
If i want to work on this, does it need to be assigned to me or just fork and start working? |
Hey @Abastien1734, |
@Abastien1734 are you still working on this or can I take over? 🙂 |
@muffpy you can take over |
@ManyTheFish Hi, I was checking for the implementation part in the code, which you pointed out in the description of this issue. I was thinking of changes, something like this in the mod.rs file-
Please help me, if I am going in the right direction. |
Hello @239yash, |
@ManyTheFish I will add the case for handling floating point numbers too, If the current approach doesn't work out. Let me add that, Will add the test cases also after implementation. Thanks! |
I have tested for floating point numbers, but it's not working out as the string "1234.5" is already broken in two pieces when passed to the above code block i.e. - "1234" and "5". This logic is working fine for pure integer strings. Can you help me with this? Should I raise a PR for this, so that you can have a look into this? One more thing, I noticed while running test cases for the given below languages, the test segmenter_segment_str is also failing.
The error printed is -
The number "1234" is getting split into individual digit strings. Can you help me with what extra changes could be done for this? I tried a few things but failed in that. |
Hello @239yash,
You may know that the separators are customizable in Charabia, meaning that they have already been processed before calling this part of the code. For the test case, it's expected that the digit characters will now be joined together. |
Hello @ManyTheFish,
It looks like I got to the same point than @239yash, I will be happy to collaborate if they are still on this issue. |
Hello @42plamusse,
Yes, you're right, but the test uses the default separator set, including
Yes you're right, the test macro should be updated to separate In an another hand, we could remove numbers from the tests in the specialized tokenizer. |
@ManyTheFish Instead of checking numbers in For checking number, I used the method proposed above. s.chars().all(|c| c.is_numeric() || c.is_ascii_punctuation()) However, I don't think this is a very elegant way to do, because it can fail or give incorrect result in some cases (e.g 1e5, 1.2.3) |
Hi! If @dqkqd's PR gets merged, I'll be happy to take it up further to handle the cases that they mentioned give incorrect results. |
Description
The string
hello 95532387
is segmented as"hello", " ", "95532387"
where the stringสปริงเกียร์ร่วม 95532387
will be segmented as"สปร\u{e34}ง", "เก\u{e35}ยร\u{e4c}", "ร\u{e48}วม", " ", "9", "5", "5", "3", "2", "3", "8", "7"
.Expected behavior
The number
95532387
should always be segmented as"95532387"
.How to solve the issue
Before calling a specialized segmenter during the segmentation process, we should check if the interleaved string only contains numbers, if not we call the specialized segmenter else we return directly
Some(s)
.Related
The text was updated successfully, but these errors were encountered: