Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-21545 Add icuwriteuprops tool #1741

Merged
merged 1 commit into from
Sep 8, 2021
Merged

Conversation

sffc
Copy link
Member

@sffc sffc commented Jun 12, 2021

Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-21545
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@sffc sffc requested review from markusicu, echeran and jefgen June 12, 2021 08:04
@sffc sffc marked this pull request as ready for review June 12, 2021 08:04
@sffc
Copy link
Member Author

sffc commented Jun 12, 2021

This PR adds a new tool, which I named "upropdump", that dumps TOML files of binary Unicode properties.

Reviewers:

  • @markusicu for the correctness of the tooling
  • @echeran for the usefulness of information the tool produces
  • @jefgen to let me know if I did the vcxproj files correctly (note: I copied the ones from makeconv, and then generated random GUIDs where it looked like I needed them)

I can run the tool like this:

$ LD_LIBRARY_PATH=lib ./bin/upropdump whitespace

and I get whitespace.toml created in return:

# Copyright (C) 2021 and later: Unicode, Inc. and others.
# License & terms of use: http://www.unicode.org/copyright.html
#
# file name: whitespace
#
# machine-generated by: upropdump.cpp

[unicode_set.data]
name = "whitespace"
serialized = [
  0x14,9,0xe,0x20,0x21,0x85,0x86,0xa0,0xa1,0x1680,0x1681,0x2000,0x200b,0x2028,0x202a,0x202f,
  0x2030,0x205f,0x2060,0x3000,0x3001
]
ranges = [
  [0x9, 0xd],
  [0x20, 0x20],
  [0x85, 0x85],
  [0xa0, 0xa0],
  [0x1680, 0x1680],
  [0x2000, 0x200a],
  [0x2028, 0x2029],
  [0x202f, 0x202f],
  [0x205f, 0x205f],
  [0x3000, 0x3000],
]

@macchiati
Copy link
Member

Just a quick question; the ranges field is just a different format for the serialized field, right?

@sffc
Copy link
Member Author

sffc commented Jun 15, 2021

Just a quick question; the ranges field is just a different format for the serialized field, right?

Correct. The two fields are intended to contain the same information in two different formats.

echeran
echeran previously approved these changes Jun 16, 2021
Copy link
Contributor

@echeran echeran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the usefulness of the data for my purposes

icu4c/source/tools/toolutil/writesrc.cpp Outdated Show resolved Hide resolved
const UCPMap* umap = u_getIntPropertyMap(uproperty, status);
handleError(status, fullPropName);

fputs("[code_point_map.data]\n", f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I think "code point map" is the interface, while code point trie and inversion map are specific concrete implementations / representations of that interface. Since this header [code_point_map.data] is a sibling to the header [code_point_trie.struct], I think it should be renamed to [inversion_map.data]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use [binary_property.data] and [enum_property.data] -- naming it for what it is rather than what data structure we use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the schema a bit. Now we have:

[[enum_property]]
long_name = "General_Category"
short_name = "gc"
# Code points `a` through `b` have value `v`, corresponding to `name`.
ranges = [
  {a=0x0, b=0x1f, v=15, name="Cc"},
  {a=0x20, b=0x20, v=12, name="Zs"},
]

[enum_property.code_point_trie]
index = []
data_8 = []
indexLength = 3365
# ...

This becomes a file with a single property, enum_property, which is an array of objects with four fields: short_name, long_name, ranges, and code_point_trie.

This seems mildly useful because it could mean you could concatenate these files together and get a well-formed array of properties out of it.

CC @iainireland

06859bf

@jefgen
Copy link
Member

jefgen commented Jun 16, 2021

  • @jefgen to let me know if I did the vcxproj files correctly (note: I copied the ones from makeconv, and then generated random GUIDs where it looked like I needed them)

Thanks. The vcxproj files look fine to me. Thanks for changing the GUIDs.

I can run the tool like this:

$ LD_LIBRARY_PATH=lib ./bin/upropdump whitespace

and I get whitespace.toml created in return:

# Copyright (C) 2021 and later: Unicode, Inc. and others.
# License & terms of use: http://www.unicode.org/copyright.html
#
# file name: whitespace
#
# machine-generated by: upropdump.cpp

[unicode_set.data]
name = "whitespace"
serialized = [
  0x14,9,0xe,0x20,0x21,0x85,0x86,0xa0,0xa1,0x1680,0x1681,0x2000,0x200b,0x2028,0x202a,0x202f,
  0x2030,0x205f,0x2060,0x3000,0x3001
]
ranges = [
  [0x9, 0xd],
  [0x20, 0x20],
  [0x85, 0x85],
  [0xa0, 0xa0],
  [0x1680, 0x1680],
  [0x2000, 0x200a],
  [0x2028, 0x2029],
  [0x202f, 0x202f],
  [0x205f, 0x205f],
  [0x3000, 0x3000],
]

FWIW, I can build and run the tool on Windows. 👍

However, the output I get in the whitespace.toml file doesn't exactly match what you got though (the copyright is missing, and names are different).

C:\icu4c\bin64>upropdump.exe whitespace
#
# file name: whitespace
#
# machine-generated by: upropdump.cpp

[unicode_set.data]
long_name = "White_Space"
name = "WSpace"
serialized = [
  0x14,9,0xe,0x20,0x21,0x85,0x86,0xa0,0xa1,0x1680,0x1681,0x2000,0x200b,0x2028,0x202a,0x202f,
  0x2030,0x205f,0x2060,0x3000,0x3001
]
ranges = [
  [0x9, 0xd],
  [0x20, 0x20],
  [0x85, 0x85],
  [0xa0, 0xa0],
  [0x1680, 0x1680],
  [0x2000, 0x200a],
  [0x2028, 0x2029],
  [0x202f, 0x202f],
  [0x205f, 0x205f],
  [0x3000, 0x3000],
]

@jefgen
Copy link
Member

jefgen commented Jun 16, 2021

Ah, sorry I missed reviewing the changes in the allinone.sln file...

I think we'll want/need to add the following changes to the solution file:

diff --git a/icu4c/source/allinone/allinone.sln b/icu4c/source/allinone/allinone.sln
index 40e8fe6a74..7e8eef1a93 100644
--- a/icu4c/source/allinone/allinone.sln
+++ b/icu4c/source/allinone/allinone.sln
@@ -476,6 +476,22 @@ Global
                {4C8454FE-81D3-4CA3-9927-29BA96F03DAC}.Release|Win32.Build.0 = Release|Win32
                {4C8454FE-81D3-4CA3-9927-29BA96F03DAC}.Release|x64.ActiveCfg = Release|x64
                {4C8454FE-81D3-4CA3-9927-29BA96F03DAC}.Release|x64.Build.0 = Release|x64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|ARM.ActiveCfg = Debug|ARM
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|ARM.Build.0 = Debug|ARM
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|ARM64.ActiveCfg = Debug|ARM64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|ARM64.Build.0 = Debug|ARM64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|Win32.ActiveCfg = Debug|Win32
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|Win32.Build.0 = Debug|Win32
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|x64.ActiveCfg = Debug|x64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|x64.Build.0 = Debug|x64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|ARM.ActiveCfg = Release|ARM
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|ARM.Build.0 = Release|ARM
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|ARM64.ActiveCfg = Release|ARM64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|ARM64.Build.0 = Release|ARM64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|Win32.ActiveCfg = Release|Win32
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|Win32.Build.0 = Release|Win32
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|x64.ActiveCfg = Release|x64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|x64.Build.0 = Release|x64
                {203EC78A-0531-43F0-A636-285439BDE025}.Debug|ARM.ActiveCfg = Debug|ARM
                {203EC78A-0531-43F0-A636-285439BDE025}.Debug|ARM.Build.0 = Debug|ARM
                {203EC78A-0531-43F0-A636-285439BDE025}.Debug|ARM64.ActiveCfg = Debug|ARM64

@markusicu markusicu self-assigned this Jun 16, 2021
.gitignore Outdated Show resolved Hide resolved
icu4c/source/tools/toolutil/writesrc.cpp Show resolved Hide resolved
icu4c/source/tools/toolutil/writesrc.cpp Outdated Show resolved Hide resolved
icu4c/source/tools/toolutil/writesrc.cpp Outdated Show resolved Hide resolved
icu4c/source/tools/toolutil/writesrc.cpp Outdated Show resolved Hide resolved
icu4c/source/tools/upropdump/upropdump.cpp Outdated Show resolved Hide resolved
icu4c/source/tools/upropdump/upropdump.cpp Outdated Show resolved Hide resolved
const UCPMap* umap = u_getIntPropertyMap(uproperty, status);
handleError(status, fullPropName);

fputs("[code_point_map.data]\n", f);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use [binary_property.data] and [enum_property.data] -- naming it for what it is rather than what data structure we use.

icu4c/source/tools/toolutil/writesrc.cpp Outdated Show resolved Hide resolved
icu4c/source/tools/upropdump/upropdump.cpp Outdated Show resolved Hide resolved
@markusicu
Copy link
Member

Just a quick question; the ranges field is just a different format for the serialized field, right?

Correct. The two fields are intended to contain the same information in two different formats.

Yes, in principle. The "serialized" field in the current version contains the uint16_t-serialized form that ICU4C uses in some places. The first value has multiple fields & flags. Any range limit above 0xffff is split into two: 0x103456 --> 0x10, 0x3456. See https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1UnicodeSet.html#a9d26697666c30ec74d5955ac735d04d7

@sffc

This comment has been minimized.

@iainireland
Copy link
Contributor

What additional work is needed to support Script_Extensions?

@sffc
Copy link
Member Author

sffc commented Aug 30, 2021

What additional work is needed to support Script_Extensions?

Good question. We will likely need to add another code path for Script_Extensions that makes use of the specialized uscript_getScriptExtensions function.

@markusicu -- what is a good data structure to use to export Script_Extensions?

I may prefer to do this in a follow-up PR.

@markusicu
Copy link
Member

What additional work is needed to support Script_Extensions?

Good question. We will likely need to add another code path for Script_Extensions that makes use of the specialized uscript_getScriptExtensions function.

@markusicu -- what is a good data structure to use to export Script_Extensions?

Is there an icu4x issue, or email thread, where we can discuss this?

I may prefer to do this in a follow-up PR.

yes!

@markusicu markusicu changed the title ICU-21545 Add upropdump tool ICU-21545 Add icuwriteuprops tool Aug 30, 2021
@markusicu
Copy link
Member

markusicu commented Aug 30, 2021

(Ignore if you don't have/add %lu here...)

On %lu vs. cast to long: fprintf() assumes that arguments are put on the stack in accordance with the string format. It's not type-safe at runtime. chars and shorts are always cast to int before they are pushed onto the stack, but I don't think that arguments necessarily go up to long. So while it may happen to work, I don't think it's quite portable to use %lu when the type may be a narrower int. Please cast to long just to be safe.

icu4c/source/tools/icuwriteuprops/Makefile.in Outdated Show resolved Hide resolved
icu4c/source/tools/icuwriteuprops/icuwriteuprops.1.in Outdated Show resolved Hide resolved
icu4c/source/tools/toolutil/writesrc.cpp Outdated Show resolved Hide resolved
icu4c/source/tools/toolutil/writesrc.cpp Outdated Show resolved Hide resolved
icu4c/source/tools/toolutil/writesrc.cpp Show resolved Hide resolved
icu4c/source/tools/icuwriteuprops/icuwriteuprops.cpp Outdated Show resolved Hide resolved
icu4c/source/tools/icuwriteuprops/icuwriteuprops.cpp Outdated Show resolved Hide resolved
icu4c/source/tools/icuwriteuprops/icuwriteuprops.cpp Outdated Show resolved Hide resolved
icu4c/source/tools/icuwriteuprops/icuwriteuprops.cpp Outdated Show resolved Hide resolved
@sffc
Copy link
Member Author

sffc commented Aug 31, 2021

What additional work is needed to support Script_Extensions?

Good question. We will likely need to add another code path for Script_Extensions that makes use of the specialized uscript_getScriptExtensions function.
@markusicu -- what is a good data structure to use to export Script_Extensions?

Is there an icu4x issue, or email thread, where we can discuss this?

https://unicode-org.atlassian.net/browse/ICU-21545

Copy link
Member Author

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the new code against emoji and it works as expected. Thanks for catching that issue!

icu4c/source/tools/icuwriteuprops/icuwriteuprops.cpp Outdated Show resolved Hide resolved
icu4c/source/tools/toolutil/writesrc.cpp Outdated Show resolved Hide resolved
icu4c/source/tools/icuwriteuprops/icuwriteuprops.cpp Outdated Show resolved Hide resolved
icu4c/source/tools/toolutil/writesrc.cpp Show resolved Hide resolved
icu4c/source/tools/icuwriteuprops/Makefile.in Outdated Show resolved Hide resolved
icu4c/source/tools/toolutil/writesrc.h Outdated Show resolved Hide resolved
@sffc sffc requested a review from markusicu September 2, 2021 01:27
Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes lgtm pse squash

@sffc sffc force-pushed the ICU-21545-propdump branch from fd9bcb1 to 619b4a3 Compare September 7, 2021 22:16
@sffc sffc requested a review from markusicu September 7, 2021 22:16
@markusicu
Copy link
Member

I didn't see the helpful "force-pushed with no changes" notification (nor one like "force-pushed, look at diffs here"). Is that bot down?

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm tnx -- assuming that the squash didn't change anything (hard to tell)

@markusicu
Copy link
Member

Now that PR #1848 is merged you might want to run the tool again and look at the output for the emoji properties of strings :-)

@sffc
Copy link
Member Author

sffc commented Sep 8, 2021

Hmm, I don't know why the bot didn't post an update as it usually does. However, when I performed the squash, I used the web UI, and I did a sanity check, and all looks OK.

Now that PR #1848 is merged you might want to run the tool again and look at the output for the emoji properties of strings :-)

Cool! I would like to merge this PR first, to get it in, and if any changes are needed for emoji properties of strings, I'll cover them in a follow-up.

@sffc sffc merged commit 92db251 into unicode-org:main Sep 8, 2021
@sffc sffc deleted the ICU-21545-propdump branch September 8, 2021 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants