-
-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU-21545 Add icuwriteuprops tool #1741
Conversation
This PR adds a new tool, which I named "upropdump", that dumps TOML files of binary Unicode properties. Reviewers:
I can run the tool like this: $ LD_LIBRARY_PATH=lib ./bin/upropdump whitespace and I get whitespace.toml created in return: # Copyright (C) 2021 and later: Unicode, Inc. and others.
# License & terms of use: http://www.unicode.org/copyright.html
#
# file name: whitespace
#
# machine-generated by: upropdump.cpp
[unicode_set.data]
name = "whitespace"
serialized = [
0x14,9,0xe,0x20,0x21,0x85,0x86,0xa0,0xa1,0x1680,0x1681,0x2000,0x200b,0x2028,0x202a,0x202f,
0x2030,0x205f,0x2060,0x3000,0x3001
]
ranges = [
[0x9, 0xd],
[0x20, 0x20],
[0x85, 0x85],
[0xa0, 0xa0],
[0x1680, 0x1680],
[0x2000, 0x200a],
[0x2028, 0x2029],
[0x202f, 0x202f],
[0x205f, 0x205f],
[0x3000, 0x3000],
] |
Just a quick question; the ranges field is just a different format for the serialized field, right? |
Correct. The two fields are intended to contain the same information in two different formats. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for the usefulness of the data for my purposes
const UCPMap* umap = u_getIntPropertyMap(uproperty, status); | ||
handleError(status, fullPropName); | ||
|
||
fputs("[code_point_map.data]\n", f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I think "code point map" is the interface, while code point trie and inversion map are specific concrete implementations / representations of that interface. Since this header [code_point_map.data]
is a sibling to the header [code_point_trie.struct]
, I think it should be renamed to [inversion_map.data]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use [binary_property.data]
and [enum_property.data]
-- naming it for what it is rather than what data structure we use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the schema a bit. Now we have:
[[enum_property]]
long_name = "General_Category"
short_name = "gc"
# Code points `a` through `b` have value `v`, corresponding to `name`.
ranges = [
{a=0x0, b=0x1f, v=15, name="Cc"},
{a=0x20, b=0x20, v=12, name="Zs"},
]
[enum_property.code_point_trie]
index = []
data_8 = []
indexLength = 3365
# ...
This becomes a file with a single property, enum_property
, which is an array of objects with four fields: short_name, long_name, ranges, and code_point_trie.
This seems mildly useful because it could mean you could concatenate these files together and get a well-formed array of properties out of it.
CC @iainireland
Thanks. The vcxproj files look fine to me. Thanks for changing the GUIDs.
FWIW, I can build and run the tool on Windows. 👍 However, the output I get in the
#
# file name: whitespace
#
# machine-generated by: upropdump.cpp
[unicode_set.data]
long_name = "White_Space"
name = "WSpace"
serialized = [
0x14,9,0xe,0x20,0x21,0x85,0x86,0xa0,0xa1,0x1680,0x1681,0x2000,0x200b,0x2028,0x202a,0x202f,
0x2030,0x205f,0x2060,0x3000,0x3001
]
ranges = [
[0x9, 0xd],
[0x20, 0x20],
[0x85, 0x85],
[0xa0, 0xa0],
[0x1680, 0x1680],
[0x2000, 0x200a],
[0x2028, 0x2029],
[0x202f, 0x202f],
[0x205f, 0x205f],
[0x3000, 0x3000],
] |
Ah, sorry I missed reviewing the changes in the I think we'll want/need to add the following changes to the solution file: diff --git a/icu4c/source/allinone/allinone.sln b/icu4c/source/allinone/allinone.sln
index 40e8fe6a74..7e8eef1a93 100644
--- a/icu4c/source/allinone/allinone.sln
+++ b/icu4c/source/allinone/allinone.sln
@@ -476,6 +476,22 @@ Global
{4C8454FE-81D3-4CA3-9927-29BA96F03DAC}.Release|Win32.Build.0 = Release|Win32
{4C8454FE-81D3-4CA3-9927-29BA96F03DAC}.Release|x64.ActiveCfg = Release|x64
{4C8454FE-81D3-4CA3-9927-29BA96F03DAC}.Release|x64.Build.0 = Release|x64
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|ARM.ActiveCfg = Debug|ARM
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|ARM.Build.0 = Debug|ARM
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|ARM64.ActiveCfg = Debug|ARM64
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|ARM64.Build.0 = Debug|ARM64
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|Win32.ActiveCfg = Debug|Win32
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|Win32.Build.0 = Debug|Win32
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|x64.ActiveCfg = Debug|x64
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|x64.Build.0 = Debug|x64
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|ARM.ActiveCfg = Release|ARM
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|ARM.Build.0 = Release|ARM
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|ARM64.ActiveCfg = Release|ARM64
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|ARM64.Build.0 = Release|ARM64
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|Win32.ActiveCfg = Release|Win32
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|Win32.Build.0 = Release|Win32
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|x64.ActiveCfg = Release|x64
+ {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|x64.Build.0 = Release|x64
{203EC78A-0531-43F0-A636-285439BDE025}.Debug|ARM.ActiveCfg = Debug|ARM
{203EC78A-0531-43F0-A636-285439BDE025}.Debug|ARM.Build.0 = Debug|ARM
{203EC78A-0531-43F0-A636-285439BDE025}.Debug|ARM64.ActiveCfg = Debug|ARM64 |
const UCPMap* umap = u_getIntPropertyMap(uproperty, status); | ||
handleError(status, fullPropName); | ||
|
||
fputs("[code_point_map.data]\n", f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use [binary_property.data]
and [enum_property.data]
-- naming it for what it is rather than what data structure we use.
Yes, in principle. The "serialized" field in the current version contains the uint16_t-serialized form that ICU4C uses in some places. The first value has multiple fields & flags. Any range limit above 0xffff is split into two: 0x103456 --> 0x10, 0x3456. See https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1UnicodeSet.html#a9d26697666c30ec74d5955ac735d04d7 |
This comment has been minimized.
This comment has been minimized.
What additional work is needed to support Script_Extensions? |
Good question. We will likely need to add another code path for Script_Extensions that makes use of the specialized @markusicu -- what is a good data structure to use to export Script_Extensions? I may prefer to do this in a follow-up PR. |
Is there an icu4x issue, or email thread, where we can discuss this?
yes! |
(Ignore if you don't have/add On |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the new code against emoji and it works as expected. Thanks for catching that issue!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes lgtm pse squash
fd9bcb1
to
619b4a3
Compare
I didn't see the helpful "force-pushed with no changes" notification (nor one like "force-pushed, look at diffs here"). Is that bot down? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm tnx -- assuming that the squash didn't change anything (hard to tell)
Now that PR #1848 is merged you might want to run the tool again and look at the output for the emoji properties of strings :-) |
Hmm, I don't know why the bot didn't post an update as it usually does. However, when I performed the squash, I used the web UI, and I did a sanity check, and all looks OK.
Cool! I would like to merge this PR first, to get it in, and if any changes are needed for emoji properties of strings, I'll cover them in a follow-up. |
Checklist