Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API for accessing Unicode properties #148

Closed
6 tasks done
sffc opened this issue Jun 24, 2020 · 8 comments · Fixed by #1204
Closed
6 tasks done

API for accessing Unicode properties #148

sffc opened this issue Jun 24, 2020 · 8 comments · Fixed by #1204
Assignees
Labels
A-design Area: Architecture or design C-unicode Component: Props, sets, tries S-epic Size: Major project (create smaller child issues) T-core Type: Required functionality
Milestone

Comments

@sffc
Copy link
Member

sffc commented Jun 24, 2020

Sub-issues:


With UnicodeSet (#91) and UCPTrie (#132) coming along, we should start thinking about what the API will look like for accessing Unicode properties.

A simple and clean solution would be a bunch of functions returning either UnicodeSet or UCPTrie, such as:

// Binary Unicode property
pub fn get_whitespace_set() -> Result<UnicodeSet, Error>;

// Enumerated Unicode property
pub fn get_digits_trie() -> Result<UCPTrie, Error>;

These functions would pull from the data provider. The data provider produces serialized sets or tries, and these functions are pretty thin wrappers that convert the serialized format to a Rust UnicodeSet or UCPTrie.

Thoughts?

@markusicu @macchiati @srl295 @EvanJP

@sffc sffc added T-core Type: Required functionality A-design Area: Architecture or design C-unicode Component: Props, sets, tries labels Jun 24, 2020
@sffc sffc added this to the 2020 Q3 milestone Jun 24, 2020
@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Jun 25, 2020
@sffc
Copy link
Member Author

sffc commented Jun 25, 2020

@kpozin suggested that we could also have an API on char, similar to toLocaleString(). It might be more convenient when getting a property for a single character.

@zbraniecki suggested looking at prior art like "is_ascii_whitespace" in the standard library.

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Jun 25, 2020
@sffc sffc assigned EvanJP and unassigned echeran Jul 23, 2020
@sffc sffc modified the milestones: 2020 Q3, ICU4X 0.1 Sep 11, 2020
@echeran echeran self-assigned this Oct 9, 2020
@echeran echeran modified the milestones: ICU4X 0.1, ICU4X 0.2 Oct 9, 2020
@sffc sffc unassigned EvanJP Mar 12, 2021
@sffc sffc modified the milestones: ICU4X 0.2, ICU4X 0.3 Apr 1, 2021
@sffc sffc added the S-epic Size: Major project (create smaller child issues) label Jul 21, 2021
@aethanyc
Copy link
Contributor

@echeran I'm looking for the API to query the line break property value given a codepoint, e.g. f(codepoint, line_break_property) -> line_break_property_value, and investigating if it's possible to replaced the API currently used in line breaker.

#[inline]
fn get_linebreak_property_latin1(codepoint: u8) -> u8 {
let codepoint = codepoint as usize;
UAX14_PROPERTY_TABLE[codepoint / 1024][(codepoint & 0x3ff)]
}

Is this issue tracking the implementation of such an API?

@sffc
Copy link
Member Author

sffc commented Jul 28, 2021

Is this issue tracking the implementation of such an API?

Yes.

@aethanyc
Copy link
Contributor

Does this issue depend on #883?

@sffc
Copy link
Member Author

sffc commented Aug 12, 2021

Does this issue depend on #883?

In part, but the non-binary enumerated property API you are requesting above needs additional work. In particular, it cannot be done until CodePointTrie is done.

@sffc sffc modified the milestones: ICU4X 0.4, 2021 Q3 0.4 Sprint B Aug 26, 2021
@sffc sffc modified the milestones: 2021 Q3 0.4 Sprint B, ICU4X 0.4 Sep 16, 2021
@aethanyc
Copy link
Contributor

Other than line breaker property, #943 also needs this API to map code point to various Unicode properties like Word_Break, Grapheme_Cluster_Break, etc.

@sffc
Copy link
Member Author

sffc commented Sep 21, 2021

I added a list of sub issues to the OP.

@sffc
Copy link
Member Author

sffc commented Nov 1, 2021

All six parts of this issue are done! Closing as fixed.

@sffc sffc closed this as completed Nov 1, 2021
@sffc sffc linked a pull request Nov 1, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design C-unicode Component: Props, sets, tries S-epic Size: Major project (create smaller child issues) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants