Design doc of code point tries for properties #559

echeran · 2021-03-18T17:05:40Z

Want to do this before beginning on #508 / #132 (which should effectively summarizing our previous meetings, discussions, and conclusions).

coveralls · 2021-03-18T17:18:27Z

Pull Request Test Coverage Report for Build 502bb83974d5473bd996b5666b60f7eb16971871-PR-559

0 of 0 changed or added relevant lines in 0 files are covered.
993 unchanged lines in 86 files lost coverage.
Overall coverage increased (+0.2%) to 73.077%

Files with Coverage Reduction	New Missed Lines	%
components/locale_canonicalizer/src/locale_canonicalizer.rs	1	87.5%
components/locid/macros/src/token_stream.rs	1	97.83%
components/locid/src/extensions/private/key.rs	1	88.24%
components/locid/src/extensions/transform/fields.rs	1	83.33%
components/locid/src/extensions/transform/key.rs	1	88.89%
components/locid/src/extensions/unicode/attributes.rs	1	83.33%
components/locid/src/extensions/unicode/key.rs	1	90.0%
components/locid/src/extensions/unicode/keywords.rs	1	80.0%
components/locid/src/parser/locale.rs	1	80.0%
components/locid/src/serde/langid.rs	1	88.24%

Totals
Change from base Build b6ed6f058a0c3b6566eac78a58f47303bf48830f:	0.2%
Covered Lines:	7182
Relevant Lines:	9828

💛 - Coveralls

dminor

Looking good so far!

dminor · 2021-03-19T13:30:27Z

docs/design/properties_code_point_trie.md

+
+## Background
+
+[Unicode Properties](https://unicode-org.github.io/icu/userguide/strings/properties.html) represent attributes of code points in the Unicode specification. 


There are a few different types of properties referred to in that link. I think it would be helpful to have a sentence explaining why we're considering just binary and enumerated properties in the rest of the document. Are they more important? Is it because they are suitable for storing in a Code Point Trie?

dminor · 2021-03-19T13:32:40Z

docs/design/properties_code_point_trie.md

+
+Before considering the design of APIs and efficient data structures, we first have to consider the shape of the data. In the binary properties case, there are two dimensions being associated: the binary property and the code point. In enumerated properties, there are three dimensions: the enumerated property, the enumerated property value, and the code point.
+
+The use cases, or manner of data access, inform the design(s) of APIs and data structures. For regular expression parsers (regex), we need to support a text description of a set of code points sharing a property. In this case, returning a [`UnicodeSet`](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html) (a set of Unicode code points) would provide the most efficient usable data. For binary properties, the property name is enough for input. For enumerated properties, the property name and a specific property value are required to uniquely determine a set of code points. In these cases, all dimensions except the code point dimension are fixed by the input value.


Suggested change

The use cases, or manner of data access, inform the design(s) of APIs and data structures. For regular expression parsers (regex), we need to support a text description of a set of code points sharing a property. In this case, returning a [`UnicodeSet`](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html) (a set of Unicode code points) would provide the most efficient usable data. For binary properties, the property name is enough for input. For enumerated properties, the property name and a specific property value are required to uniquely determine a set of code points. In these cases, all dimensions except the code point dimension are fixed by the input value.

The use cases, or manner of data access, inform the design of APIs and data structures. For regular expression parsers (regex), we need to support a text description of a set of code points sharing a property. In this case, returning a [`UnicodeSet`](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html) (a set of Unicode code points) provides the most efficient usable data. For binary properties, the property name is enough for input. For enumerated properties, the property name and a specific property value are required to uniquely determine a set of code points. In these cases, all dimensions except the code point dimension are fixed by the input value.

codecov-io · 2021-03-25T17:12:36Z

Codecov Report

❗ No coverage uploaded for pull request base (main@a4a8e4a). Click here to learn what that means.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #559   +/-   ##
=======================================
  Coverage        ?   74.22%           
=======================================
  Files           ?      128           
  Lines           ?     7840           
  Branches        ?        0           
=======================================
  Hits            ?     5819           
  Misses          ?     2021           
  Partials        ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4a8e4a...0e677af. Read the comment docs.

dminor

Looks good to me! Just a few suggestions.

docs/design/properties_code_point_trie.md

dminor · 2021-03-29T15:25:26Z

docs/design/properties_code_point_trie.md

+
+### Notes on Implementation
+
+`UnicodeSet` represents a set of Unicode code points. The combination of those 2 aspects -- Unicode code point values fill the entire integer range from 0 to 0x10FFFF, and that a set has only 2 values -- together allow for an inversion list implementation that is optimally efficient. An inversion list stores the boundaries of each range (contiguous stretch of code points) that are included in the set. This makes the size of the inversion list range from O(1) to O(n) (and oftentimes O(1)) even when the cardinality of the values logically represented is O(n). Checking for inclusion is just a matter of running binary search on the boundary values and checking if the corresponding inversion list index value is even or odd.


suggestion: please linkify inversion list to point to Wikipedia or another source to read more about inversion lists.

dminor · 2021-03-29T15:27:26Z

docs/design/properties_code_point_trie.md

+
+The use cases, or manner of data access, inform the designs of APIs and data structures. For regular expression parsers (regex), we need to support a text description of a set of code points sharing a property. In this case, returning a [`UnicodeSet`](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html) (a set of Unicode code points) provides the most efficient usable data. For binary properties, the property name is enough for input. For enumerated properties, the property name and a specific property value are required to uniquely determine a set of code points. In these cases, all dimensions except the code point dimension are fixed (given as inputs).
+
+In other cases, such as the implementation of internationalization algorithms, iteration through code points is a typical implementation strategy. During such iteration, the value of a code point property -- usually, an enumerated property -- can inform the algorithm in question. In such cases, the code point value and enumerated property name dimensions must be fixed (provided as inputs), and the return value is the remaining dimension -- the enumerated property value. To support this use case, the [`CodePointTrie`](https://sites.google.com/site/icusite/design/struct/utrie) data structure is an optimal implementation.


suggestion: optimal in what sense?

dminor · 2021-03-29T15:28:09Z

docs/design/properties_code_point_trie.md

+
+A `CodePointTrie` optimizes over a generic inversion map in different ways. One example is that for code point values in the BMP range (16 bits), the 16 bits can be split into the high-order 10 bits and the low-order 6 bits, where the low-order 6 bits can be used as an index into a table/index array. Also, `CodePointTrie`s can be created where the binary data values are serialized with 8-bit, 16-bit, and 32-bit encoding, to make lookups more efficient without making encoding conversions.
+
+`CodePointTrie` code in ICU4C is implemented with a mutable builder, a method to convert the mutable builder to an immutable version, and code to read from the immutable version. The immutable version is stored in memory the same as it is serialized to persistent storage.


suggestion: add a link to the ICU4C documentation and/or implementation.

sffc

Non-blocking comments

docs/design/properties_code_point_trie.md

sffc · 2021-03-29T18:19:04Z

docs/design/properties_code_point_trie.md

+
+#### Option 2: Implement a reader for the ICU4C `CodePointTrie` binary data directly in Rust in ICU4X
+
+This option entails writing Rust code that can interpret the binary serialization of the `CodePointTrie` and navigate it directly. It would require also creating an "offline" step (relative to ICU4X) in which ICU4C binary data is exported as a companion package of data in the data downloads for new each ICU release.


Please add a reference to #509

Co-authored-by: Shane F. Carr <shane@unicode.org> Co-authored-by: Dan Minor <dminor@mozilla.com>

dminor

Looks good, thanks for adding the extra explanations!

Beginnings of design doc (summary) of code point trie for properties

f8a417e

dminor reviewed Mar 19, 2021

View reviewed changes

Add more background info

0e677af

echeran added 2 commits March 25, 2021 15:33

Finish draft of design doc

eec647c

Apply PR feedback and formatting

fe52471

echeran marked this pull request as ready for review March 25, 2021 22:44

echeran requested a review from a team as a code owner March 25, 2021 22:44

dminor previously approved these changes Mar 29, 2021

View reviewed changes

sffc reviewed Mar 29, 2021

View reviewed changes

sffc added the waiting-on-author PRs waiting for action from the author for >7 days label Apr 1, 2021

Apply suggestions from code review

5f4e965

Co-authored-by: Shane F. Carr <shane@unicode.org> Co-authored-by: Dan Minor <dminor@mozilla.com>

echeran dismissed dminor’s stale review via 5f4e965 April 5, 2021 20:56

echeran added 2 commits April 5, 2021 15:31

Add more info on properties, respond to more review feedback

32bc22d

Add link to relevant GH issue

4114446

echeran requested a review from dminor April 9, 2021 04:27

dminor approved these changes Apr 9, 2021

View reviewed changes

echeran requested a review from sffc April 9, 2021 16:45

sffc removed the waiting-on-author PRs waiting for action from the author for >7 days label Apr 9, 2021

sffc approved these changes Apr 9, 2021

View reviewed changes

echeran merged commit 3927561 into unicode-org:main Apr 9, 2021

echeran deleted the docs-code-point-trie branch April 9, 2021 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design doc of code point tries for properties #559

Design doc of code point tries for properties #559

echeran commented Mar 18, 2021 •

edited by sffc

Loading

coveralls commented Mar 18, 2021 •

edited

Loading

dminor left a comment

dminor Mar 19, 2021

dminor Mar 19, 2021

echeran Mar 25, 2021

codecov-io commented Mar 25, 2021

dminor left a comment

dminor Mar 29, 2021

dminor Mar 29, 2021

dminor Mar 29, 2021

sffc left a comment

sffc Mar 29, 2021

dminor left a comment


		## Background

		[Unicode Properties](https://unicode-org.github.io/icu/userguide/strings/properties.html) represent attributes of code points in the Unicode specification.


		Before considering the design of APIs and efficient data structures, we first have to consider the shape of the data. In the binary properties case, there are two dimensions being associated: the binary property and the code point. In enumerated properties, there are three dimensions: the enumerated property, the enumerated property value, and the code point.

		The use cases, or manner of data access, inform the design(s) of APIs and data structures. For regular expression parsers (regex), we need to support a text description of a set of code points sharing a property. In this case, returning a [`UnicodeSet`](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html) (a set of Unicode code points) would provide the most efficient usable data. For binary properties, the property name is enough for input. For enumerated properties, the property name and a specific property value are required to uniquely determine a set of code points. In these cases, all dimensions except the code point dimension are fixed by the input value.


		### Notes on Implementation

		`UnicodeSet` represents a set of Unicode code points. The combination of those 2 aspects -- Unicode code point values fill the entire integer range from 0 to 0x10FFFF, and that a set has only 2 values -- together allow for an inversion list implementation that is optimally efficient. An inversion list stores the boundaries of each range (contiguous stretch of code points) that are included in the set. This makes the size of the inversion list range from O(1) to O(n) (and oftentimes O(1)) even when the cardinality of the values logically represented is O(n). Checking for inclusion is just a matter of running binary search on the boundary values and checking if the corresponding inversion list index value is even or odd.


		The use cases, or manner of data access, inform the designs of APIs and data structures. For regular expression parsers (regex), we need to support a text description of a set of code points sharing a property. In this case, returning a [`UnicodeSet`](https://unicode-org.github.io/icu/userguide/strings/unicodeset.html) (a set of Unicode code points) provides the most efficient usable data. For binary properties, the property name is enough for input. For enumerated properties, the property name and a specific property value are required to uniquely determine a set of code points. In these cases, all dimensions except the code point dimension are fixed (given as inputs).

		In other cases, such as the implementation of internationalization algorithms, iteration through code points is a typical implementation strategy. During such iteration, the value of a code point property -- usually, an enumerated property -- can inform the algorithm in question. In such cases, the code point value and enumerated property name dimensions must be fixed (provided as inputs), and the return value is the remaining dimension -- the enumerated property value. To support this use case, the [`CodePointTrie`](https://sites.google.com/site/icusite/design/struct/utrie) data structure is an optimal implementation.


		A `CodePointTrie` optimizes over a generic inversion map in different ways. One example is that for code point values in the BMP range (16 bits), the 16 bits can be split into the high-order 10 bits and the low-order 6 bits, where the low-order 6 bits can be used as an index into a table/index array. Also, `CodePointTrie`s can be created where the binary data values are serialized with 8-bit, 16-bit, and 32-bit encoding, to make lookups more efficient without making encoding conversions.

		`CodePointTrie` code in ICU4C is implemented with a mutable builder, a method to convert the mutable builder to an immutable version, and code to read from the immutable version. The immutable version is stored in memory the same as it is serialized to persistent storage.


		#### Option 2: Implement a reader for the ICU4C `CodePointTrie` binary data directly in Rust in ICU4X

		This option entails writing Rust code that can interpret the binary serialization of the `CodePointTrie` and navigate it directly. It would require also creating an "offline" step (relative to ICU4X) in which ICU4C binary data is exported as a companion package of data in the data downloads for new each ICU release.

Design doc of code point tries for properties #559

Design doc of code point tries for properties #559

Conversation

echeran commented Mar 18, 2021 • edited by sffc Loading

coveralls commented Mar 18, 2021 • edited Loading

Pull Request Test Coverage Report for Build 502bb83974d5473bd996b5666b60f7eb16971871-PR-559

💛 - Coveralls

dminor left a comment

Choose a reason for hiding this comment

dminor Mar 19, 2021

Choose a reason for hiding this comment

dminor Mar 19, 2021

Choose a reason for hiding this comment

echeran Mar 25, 2021

Choose a reason for hiding this comment

codecov-io commented Mar 25, 2021

Codecov Report

dminor left a comment

Choose a reason for hiding this comment

dminor Mar 29, 2021

Choose a reason for hiding this comment

dminor Mar 29, 2021

Choose a reason for hiding this comment

dminor Mar 29, 2021

Choose a reason for hiding this comment

sffc left a comment

Choose a reason for hiding this comment

sffc Mar 29, 2021

Choose a reason for hiding this comment

dminor left a comment

Choose a reason for hiding this comment

echeran commented Mar 18, 2021 •

edited by sffc

Loading

coveralls commented Mar 18, 2021 •

edited

Loading