Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design doc of code point tries for properties #559

Merged
merged 7 commits into from
Apr 9, 2021

Conversation

echeran
Copy link
Contributor

@echeran echeran commented Mar 18, 2021

Want to do this before beginning on #508 / #132 (which should effectively summarizing our previous meetings, discussions, and conclusions).

@coveralls
Copy link

coveralls commented Mar 18, 2021

Pull Request Test Coverage Report for Build 502bb83974d5473bd996b5666b60f7eb16971871-PR-559

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 993 unchanged lines in 86 files lost coverage.
  • Overall coverage increased (+0.2%) to 73.077%

Files with Coverage Reduction New Missed Lines %
components/locale_canonicalizer/src/locale_canonicalizer.rs 1 87.5%
components/locid/macros/src/token_stream.rs 1 97.83%
components/locid/src/extensions/private/key.rs 1 88.24%
components/locid/src/extensions/transform/fields.rs 1 83.33%
components/locid/src/extensions/transform/key.rs 1 88.89%
components/locid/src/extensions/unicode/attributes.rs 1 83.33%
components/locid/src/extensions/unicode/key.rs 1 90.0%
components/locid/src/extensions/unicode/keywords.rs 1 80.0%
components/locid/src/parser/locale.rs 1 80.0%
components/locid/src/serde/langid.rs 1 88.24%
Totals Coverage Status
Change from base Build b6ed6f058a0c3b6566eac78a58f47303bf48830f: 0.2%
Covered Lines: 7182
Relevant Lines: 9828

💛 - Coveralls

Copy link
Contributor

@dminor dminor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far!


## Background

[Unicode Properties](https://unicode-org.github.io/icu/userguide/strings/properties.html) represent attributes of code points in the Unicode specification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few different types of properties referred to in that link. I think it would be helpful to have a sentence explaining why we're considering just binary and enumerated properties in the rest of the document. Are they more important? Is it because they are suitable for storing in a Code Point Trie?


Before considering the design of APIs and efficient data structures, we first have to consider the shape of the data. In the binary properties case, there are two dimensions being associated: the binary property and the code point. In enumerated properties, there are three dimensions: the enumerated property, the enumerated property value, and the code point.

The use cases, or manner of data access, inform the design(s) of APIs and data structures. For regular expression parsers (regex), we need to support a text description of a set of code points sharing a property. In this case, returning a [`UnicodeSet`](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html) (a set of Unicode code points) would provide the most efficient usable data. For binary properties, the property name is enough for input. For enumerated properties, the property name and a specific property value are required to uniquely determine a set of code points. In these cases, all dimensions except the code point dimension are fixed by the input value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The use cases, or manner of data access, inform the design(s) of APIs and data structures. For regular expression parsers (regex), we need to support a text description of a set of code points sharing a property. In this case, returning a [`UnicodeSet`](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html) (a set of Unicode code points) would provide the most efficient usable data. For binary properties, the property name is enough for input. For enumerated properties, the property name and a specific property value are required to uniquely determine a set of code points. In these cases, all dimensions except the code point dimension are fixed by the input value.
The use cases, or manner of data access, inform the design of APIs and data structures. For regular expression parsers (regex), we need to support a text description of a set of code points sharing a property. In this case, returning a [`UnicodeSet`](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html) (a set of Unicode code points) provides the most efficient usable data. For binary properties, the property name is enough for input. For enumerated properties, the property name and a specific property value are required to uniquely determine a set of code points. In these cases, all dimensions except the code point dimension are fixed by the input value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (main@a4a8e4a). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #559   +/-   ##
=======================================
  Coverage        ?   74.22%           
=======================================
  Files           ?      128           
  Lines           ?     7840           
  Branches        ?        0           
=======================================
  Hits            ?     5819           
  Misses          ?     2021           
  Partials        ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4a8e4a...0e677af. Read the comment docs.

@echeran echeran marked this pull request as ready for review March 25, 2021 22:44
@echeran echeran requested a review from a team as a code owner March 25, 2021 22:44
dminor
dminor previously approved these changes Mar 29, 2021
Copy link
Contributor

@dminor dminor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just a few suggestions.

docs/design/properties_code_point_trie.md Outdated Show resolved Hide resolved
docs/design/properties_code_point_trie.md Outdated Show resolved Hide resolved
docs/design/properties_code_point_trie.md Outdated Show resolved Hide resolved
docs/design/properties_code_point_trie.md Show resolved Hide resolved

### Notes on Implementation

`UnicodeSet` represents a set of Unicode code points. The combination of those 2 aspects -- Unicode code point values fill the entire integer range from 0 to 0x10FFFF, and that a set has only 2 values -- together allow for an inversion list implementation that is optimally efficient. An inversion list stores the boundaries of each range (contiguous stretch of code points) that are included in the set. This makes the size of the inversion list range from O(1) to O(n) (and oftentimes O(1)) even when the cardinality of the values logically represented is O(n). Checking for inclusion is just a matter of running binary search on the boundary values and checking if the corresponding inversion list index value is even or odd.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: please linkify inversion list to point to Wikipedia or another source to read more about inversion lists.


The use cases, or manner of data access, inform the designs of APIs and data structures. For regular expression parsers (regex), we need to support a text description of a set of code points sharing a property. In this case, returning a [`UnicodeSet`](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html) (a set of Unicode code points) provides the most efficient usable data. For binary properties, the property name is enough for input. For enumerated properties, the property name and a specific property value are required to uniquely determine a set of code points. In these cases, all dimensions except the code point dimension are fixed (given as inputs).

In other cases, such as the implementation of internationalization algorithms, iteration through code points is a typical implementation strategy. During such iteration, the value of a code point property -- usually, an enumerated property -- can inform the algorithm in question. In such cases, the code point value and enumerated property name dimensions must be fixed (provided as inputs), and the return value is the remaining dimension -- the enumerated property value. To support this use case, the [`CodePointTrie`](https://sites.google.com/site/icusite/design/struct/utrie) data structure is an optimal implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: optimal in what sense?


A `CodePointTrie` optimizes over a generic inversion map in different ways. One example is that for code point values in the BMP range (16 bits), the 16 bits can be split into the high-order 10 bits and the low-order 6 bits, where the low-order 6 bits can be used as an index into a table/index array. Also, `CodePointTrie`s can be created where the binary data values are serialized with 8-bit, 16-bit, and 32-bit encoding, to make lookups more efficient without making encoding conversions.

`CodePointTrie` code in ICU4C is implemented with a mutable builder, a method to convert the mutable builder to an immutable version, and code to read from the immutable version. The immutable version is stored in memory the same as it is serialized to persistent storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: add a link to the ICU4C documentation and/or implementation.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking comments

docs/design/properties_code_point_trie.md Outdated Show resolved Hide resolved
docs/design/properties_code_point_trie.md Outdated Show resolved Hide resolved
docs/design/properties_code_point_trie.md Outdated Show resolved Hide resolved

#### Option 2: Implement a reader for the ICU4C `CodePointTrie` binary data directly in Rust in ICU4X

This option entails writing Rust code that can interpret the binary serialization of the `CodePointTrie` and navigate it directly. It would require also creating an "offline" step (relative to ICU4X) in which ICU4C binary data is exported as a companion package of data in the data downloads for new each ICU release.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a reference to #509

@sffc sffc added the waiting-on-author PRs waiting for action from the author for >7 days label Apr 1, 2021
Co-authored-by: Shane F. Carr <shane@unicode.org>
Co-authored-by: Dan Minor <dminor@mozilla.com>
@echeran echeran requested a review from dminor April 9, 2021 04:27
Copy link
Contributor

@dminor dminor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for adding the extra explanations!

@echeran echeran requested a review from sffc April 9, 2021 16:45
@sffc sffc removed the waiting-on-author PRs waiting for action from the author for >7 days label Apr 9, 2021
@echeran echeran merged commit 3927561 into unicode-org:main Apr 9, 2021
@echeran echeran deleted the docs-code-point-trie branch April 9, 2021 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants