-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LocaleData parameter for word/sentence segmenter #5318
Conversation
@@ -15,5 +15,7 @@ segmenter/lstm/wl_auto@1, und/Burmese_codepoints_exclusive_model4_heavy, 91369B, | |||
segmenter/lstm/wl_auto@1, und/Khmer_codepoints_exclusive_model4_heavy, 74669B, b25f5219c4b970f2 | |||
segmenter/lstm/wl_auto@1, und/Lao_codepoints_exclusive_model4_heavy, 72164B, 7e0c3ea7801791bd | |||
segmenter/lstm/wl_auto@1, und/Thai_codepoints_exclusive_model4_heavy, 72331B, c46e2e0c098c1fc1 | |||
segmenter/sentence/override@1, <singleton>, 332B, 745d858a06d47385 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: these sizes look much better :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I approve of this architecturally, despite the extra branch at runtime. I think this is the right data model, and if it turns out to be a performance issue, we can tweak the runtime algorithm a bit.
components/segmenter/src/sentence.rs
Outdated
); | ||
|
||
#[doc = icu_provider::gen_any_buffer_unstable_docs!(UNSTABLE, Self::new)] | ||
pub fn try_new_unstable<D>(provider: &D) -> Result<Self, DataError> | ||
pub fn try_new_unstable<D>(provider: &D, locale: &DataLocale) -> Result<Self, DataError> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This function signature is still not what we agreed in the issue, but it can be fixed later (but before 2.0)
components/segmenter/src/word.rs
Outdated
|
||
fn is_default_rule(locale: &DataLocale) -> bool { | ||
let lang = locale.language(); | ||
lang != language!("fi") && lang != language!("sv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: please just try to load from data.
It would be okay to skip und
, but please note that in the issue we agreed that the function should take an optional locale, and you should skip loading when the locale is None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please just always check the data provider. Don't rely on a hard-coded list of locales.
da9e62f
to
f041805
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments were not fixed.
Example code for how to optionally load the locale data: if let Some(locale) = options.content_locale {
let request = // ...
match provider.load(request) {
Ok(response) => {
self.locale_specific_data = response.payload;
}
Err(DataError {
kind: DataErrorKind::IdentifierNotFound,
..
}) => {
// fall through
}
Err(e) => return Err(e),
}
} |
My understand is that current macro (
|
#3284 (comment) explains about |
df15806
to
21ecf90
Compare
components/segmenter/src/word.rs
Outdated
|
||
fn is_default_rule(locale: &DataLocale) -> bool { | ||
let lang = locale.language(); | ||
lang != language!("fi") && lang != language!("sv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please just always check the data provider. Don't rely on a hard-coded list of locales.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems all the previous review comments have been addressed.
Add LocaleData parameter for word/sentence segmenter
This is a part of #3284.
ICU4C has some language break rules for word and sentence segmenter, so this fix adds some rules to ICU4X per locale.
This adds LocaleData argument to all constructors. Also, locale difference is small and 2 data only, I add the override table data marker for machine state property.