-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design doc of code point tries for properties #559
Conversation
Pull Request Test Coverage Report for Build 502bb83974d5473bd996b5666b60f7eb16971871-PR-559
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far!
|
||
## Background | ||
|
||
[Unicode Properties](https://unicode-org.github.io/icu/userguide/strings/properties.html) represent attributes of code points in the Unicode specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few different types of properties referred to in that link. I think it would be helpful to have a sentence explaining why we're considering just binary and enumerated properties in the rest of the document. Are they more important? Is it because they are suitable for storing in a Code Point Trie?
|
||
Before considering the design of APIs and efficient data structures, we first have to consider the shape of the data. In the binary properties case, there are two dimensions being associated: the binary property and the code point. In enumerated properties, there are three dimensions: the enumerated property, the enumerated property value, and the code point. | ||
|
||
The use cases, or manner of data access, inform the design(s) of APIs and data structures. For regular expression parsers (regex), we need to support a text description of a set of code points sharing a property. In this case, returning a [`UnicodeSet`](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html) (a set of Unicode code points) would provide the most efficient usable data. For binary properties, the property name is enough for input. For enumerated properties, the property name and a specific property value are required to uniquely determine a set of code points. In these cases, all dimensions except the code point dimension are fixed by the input value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use cases, or manner of data access, inform the design(s) of APIs and data structures. For regular expression parsers (regex), we need to support a text description of a set of code points sharing a property. In this case, returning a [`UnicodeSet`](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html) (a set of Unicode code points) would provide the most efficient usable data. For binary properties, the property name is enough for input. For enumerated properties, the property name and a specific property value are required to uniquely determine a set of code points. In these cases, all dimensions except the code point dimension are fixed by the input value. | |
The use cases, or manner of data access, inform the design of APIs and data structures. For regular expression parsers (regex), we need to support a text description of a set of code points sharing a property. In this case, returning a [`UnicodeSet`](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html) (a set of Unicode code points) provides the most efficient usable data. For binary properties, the property name is enough for input. For enumerated properties, the property name and a specific property value are required to uniquely determine a set of code points. In these cases, all dimensions except the code point dimension are fixed by the input value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Codecov Report
@@ Coverage Diff @@
## main #559 +/- ##
=======================================
Coverage ? 74.22%
=======================================
Files ? 128
Lines ? 7840
Branches ? 0
=======================================
Hits ? 5819
Misses ? 2021
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Just a few suggestions.
|
||
### Notes on Implementation | ||
|
||
`UnicodeSet` represents a set of Unicode code points. The combination of those 2 aspects -- Unicode code point values fill the entire integer range from 0 to 0x10FFFF, and that a set has only 2 values -- together allow for an inversion list implementation that is optimally efficient. An inversion list stores the boundaries of each range (contiguous stretch of code points) that are included in the set. This makes the size of the inversion list range from O(1) to O(n) (and oftentimes O(1)) even when the cardinality of the values logically represented is O(n). Checking for inclusion is just a matter of running binary search on the boundary values and checking if the corresponding inversion list index value is even or odd. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: please linkify inversion list to point to Wikipedia or another source to read more about inversion lists.
|
||
The use cases, or manner of data access, inform the designs of APIs and data structures. For regular expression parsers (regex), we need to support a text description of a set of code points sharing a property. In this case, returning a [`UnicodeSet`](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html) (a set of Unicode code points) provides the most efficient usable data. For binary properties, the property name is enough for input. For enumerated properties, the property name and a specific property value are required to uniquely determine a set of code points. In these cases, all dimensions except the code point dimension are fixed (given as inputs). | ||
|
||
In other cases, such as the implementation of internationalization algorithms, iteration through code points is a typical implementation strategy. During such iteration, the value of a code point property -- usually, an enumerated property -- can inform the algorithm in question. In such cases, the code point value and enumerated property name dimensions must be fixed (provided as inputs), and the return value is the remaining dimension -- the enumerated property value. To support this use case, the [`CodePointTrie`](https://sites.google.com/site/icusite/design/struct/utrie) data structure is an optimal implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: optimal in what sense?
|
||
A `CodePointTrie` optimizes over a generic inversion map in different ways. One example is that for code point values in the BMP range (16 bits), the 16 bits can be split into the high-order 10 bits and the low-order 6 bits, where the low-order 6 bits can be used as an index into a table/index array. Also, `CodePointTrie`s can be created where the binary data values are serialized with 8-bit, 16-bit, and 32-bit encoding, to make lookups more efficient without making encoding conversions. | ||
|
||
`CodePointTrie` code in ICU4C is implemented with a mutable builder, a method to convert the mutable builder to an immutable version, and code to read from the immutable version. The immutable version is stored in memory the same as it is serialized to persistent storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: add a link to the ICU4C documentation and/or implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking comments
|
||
#### Option 2: Implement a reader for the ICU4C `CodePointTrie` binary data directly in Rust in ICU4X | ||
|
||
This option entails writing Rust code that can interpret the binary serialization of the `CodePointTrie` and navigate it directly. It would require also creating an "offline" step (relative to ICU4X) in which ICU4C binary data is exported as a companion package of data in the data downloads for new each ICU release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a reference to #509
Co-authored-by: Shane F. Carr <shane@unicode.org> Co-authored-by: Dan Minor <dminor@mozilla.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for adding the extra explanations!
Want to do this before beginning on #508 / #132 (which should effectively summarizing our previous meetings, discussions, and conclusions).