-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak how locales are generated in icu_segmenter datagen #3408
Comments
So don't filter at all? Or make segmentation languages a new option separate from locales? |
Discuss with: Optional: |
I feel it is reasonable to decouple segmenter languages from locale. That is, always generate all segmenter languages regardless of the locales. If it's desirable for some users to reduce the data size (e.g. they only ever want to use lstm but never dictionary), we probably should provide another datagen option to generate only partial data that matches the segmenter constructor. For example, the value for the option can be |
Proposal:
Updated proposal:
In the API, this is passed in as a Note: we can support Thai_graphclust_model4_heavy, as well as other ones like model5 and model7, through an auxiliary key. |
Currently, the segmentation locales are filtered with the same machinery as languages. However, we should figure out a better way to filter them, since you usually want to carry segmentation models for all languages since you might encounter text in those languages.
For example, the
th
model should be included even if only building for English because we need to be able to segmentth
text, which can occur even in English documents.The text was updated successfully, but these errors were encountered: