-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load ICU4C data export into ICU4X #578
Comments
This comment has been minimized.
This comment has been minimized.
CC @iainireland, @echeran, @zbraniecki, @Manishearth I implemented the ICU4C side of this in unicode-org/icu#1741. It produces TOML files such as: Binary property example: # White_Space.toml
[unicode_set.data]
long_name = "White_Space"
name = "WSpace"
serialized = [
0x14,9,0xe,0x20,0x21,0x85,0x86,0xa0,0xa1,0x1680,0x1681,0x2000,0x200b,0x2028,0x202a,0x202f,
0x2030,0x205f,0x2060,0x3000,0x3001
]
ranges = [
[0x9, 0xd],
[0x20, 0x20],
[0x85, 0x85],
[0xa0, 0xa0],
[0x1680, 0x1680],
[0x2000, 0x200a],
[0x2028, 0x2029],
[0x202f, 0x202f],
[0x205f, 0x205f],
[0x3000, 0x3000],
] Enumerated property example: # General_Category.toml
[code_point_map.data]
long_name = "General_Category"
name = "gc"
ranges = [
[0x0, 0x1f, 15, "Control", "Cc"],
[0x20, 0x20, 12, "Space_Separator", "Zs"],
[0x21, 0x23, 23, "Other_Punctuation", "Po"],
[0x24, 0x24, 25, "Currency_Symbol", "Sc"],
# ...
[0x100000, 0x10fffd, 17, "Private_Use", "Co"],
[0x10fffe, 0x10ffff, 0, "Unassigned", "Cn"],
]
[code_point_trie.struct]
long_name = "General_Category"
name = "gc"
index = [
# ...
]
data_32 = [
# ...
]
indexLength = 3366
dataLength = 9743
highStart = 0x110000
shifted12HighStart = 0x110
type = 1
valueWidth = 1
index3NullOffset = 0x787
dataNullOffset = 0x74a
nullValue = 0x0 The next steps are as follows:
I do not need to do this, and it would be good for team education if someone else did it. I will of course be available to advise. Volunteers? If there are no volunteers, I will try to do this myself before the end of the quarter. |
This looks really good and useful! Ashwini will need this for her irregexp work. Edit: specifically, Ashwini will need the |
Closing this issue as a duplicate of #148 |
After the tracking issue #509 is finished, we should pull the new ICU4C data export file into ICU4X. The data should live alongside the CLDR JSON data in the testdata project. It should leverage the same download mechanism (#414). This should be used as a data source both for the CodePointTries as well as for ppucd.txt (split from #576).
The text was updated successfully, but these errors were encountered: