Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final crate locations for unicodeset_parser #3849

Closed
skius opened this issue Aug 11, 2023 · 6 comments · Fixed by #4054
Closed

Final crate locations for unicodeset_parser #3849

skius opened this issue Aug 11, 2023 · 6 comments · Fixed by #4054
Assignees
Labels
S-small Size: One afternoon (small bug fix or enhancement)

Comments

@skius
Copy link
Member

skius commented Aug 11, 2023

Where should the parsing crates for UnicodeSets (icu_unicodeset_parser) and transliterators (icu_transliterator_parser) end up?

Suggestion: The UnicodeSet parser could live as a utility crate under icu_collections, as it parses into CodePointInversionListAndStringList.

@skius skius added the discuss Discuss at a future ICU4X-SC meeting label Aug 11, 2023
@skius skius added this to the 1.3 Blocking ⟨P1⟩ milestone Aug 11, 2023
@sffc
Copy link
Member

sffc commented Aug 29, 2023

Decisions:

  • Transliterator parser behind a datagen feature
  • Note that the whole icu_transliterate crate is experimental for now
  • The icu_transliteration crate should be renamed to experimental/icu_transliterate
  • UnicodeSet Parser goes to experimental/unicodeset_parse
  • Release icu_transliterate under the metacrate as an experimental component but not unicodeset_parse at this time

LGTM: @sffc @skius @robertbastian @eggrobin @Manishearth

Notes:

  • @sffc - Should it be icu_translit?
  • @skius - Looks a lot like "translate"
  • @robertbastian - We say icu_properties not icu_props
  • @eggrobin - We're not restricted by the 8-character rule in Apple C
  • @eggrobin - Some sort of regular set expression library may be a dependency of the unicodeset spec, and it is implemented by ICU4C/ICU4J
  • @Manishearth - Why not do icu_unicodeset_parser a separate crate like icu_codepointtrie_builder?
  • @sffc - icu_codepointtrie_builder is separate because of WASM/C build issues. That doesn't apply to unicodeset parser.
  • @robertbastian - unicodeset parser requires properties, like codepointtrie builder requires wasm.
  • @robertbastian - It shouldn't be a utils crate because we don't want utils depending on components
  • @sffc - I think it should be experimental/unicodeset_parse right now and eventually components/unicodeset_parse

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Aug 29, 2023
@sffc sffc added the S-small Size: One afternoon (small bug fix or enhancement) label Aug 29, 2023
@skius
Copy link
Member Author

skius commented Aug 29, 2023

On the topic of icu_transliterate vs icu_transliterator - there's icu_normalizer, icu_collator, icu_segmenter (noun-based) - is icu_casemap (i.e., verb-based) the example we want to follow?

@Manishearth
Copy link
Member

I would lean towards more consistency (and we still have time to rename casemap) but I do kinda like the shorter name

@eggrobin
Copy link
Member

I prefer transliterator regardless of consistency.

@sffc
Copy link
Member

sffc commented Aug 29, 2023

Oh, yeah, I would consider it a mistake that we did noun-based names for the older crates. I think using "er"/"or" for the crate name is wrong because the crate contains functionality and doesn't declare exactly how that functionality is surfaced. The icu_transliterate crate helps you transliterate things, and the Transliterator is the way you do it.

Go English making half of these end with "er" and half with "or".

We don't have the suffix for the formatter crates like icu_decimal, icu_list, and icu_datetime.

If we were starting from scratch I'd like to see:

  • icu_collate::Collator
  • icu_transliterate::Transliterator
  • icu_casemap::CaseMapper
  • icu_normalize::Normalizer
  • icu_segments::Segmenter

I think casemap is different enough that we should stick with icu_casemap.

But I'm torn between icu_transliterate and icu_transliterator between consistency and correctness.

Note that googling for "transliterate" gives much more useful results than "transliterator". So "transliterate" is better from an education point of view.

How much are people going to really notice the difference between "collator" and "transliterate"? They serve different users with different use cases.

So I think I lean toward icu_transliterate.

@robertbastian
Copy link
Member

icu_segment::Segmenter, no? icu_<verb>::<verb>+[eo]r

@robertbastian robertbastian changed the title Final crate locations for unicodeset_parser, transliterator_parser Final crate locations for unicodeset_parser Sep 13, 2023
@robertbastian robertbastian removed their assignment Sep 14, 2023
@robertbastian robertbastian self-assigned this Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-small Size: One afternoon (small bug fix or enhancement)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants