ICU-21545 Add icuwriteuprops tool #1741

sffc · 2021-06-12T06:46:25Z

Checklist

Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-21545
Required: The PR title must be prefixed with a JIRA Issue number.
Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
Required: Each commit message must be prefixed with a JIRA Issue number.
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

sffc · 2021-06-12T08:08:51Z

This PR adds a new tool, which I named "upropdump", that dumps TOML files of binary Unicode properties.

Reviewers:

@markusicu for the correctness of the tooling
@echeran for the usefulness of information the tool produces
@jefgen to let me know if I did the vcxproj files correctly (note: I copied the ones from makeconv, and then generated random GUIDs where it looked like I needed them)

I can run the tool like this:

$ LD_LIBRARY_PATH=lib ./bin/upropdump whitespace

and I get whitespace.toml created in return:

# Copyright (C) 2021 and later: Unicode, Inc. and others.
# License & terms of use: http://www.unicode.org/copyright.html
#
# file name: whitespace
#
# machine-generated by: upropdump.cpp

[unicode_set.data]
name = "whitespace"
serialized = [
  0x14,9,0xe,0x20,0x21,0x85,0x86,0xa0,0xa1,0x1680,0x1681,0x2000,0x200b,0x2028,0x202a,0x202f,
  0x2030,0x205f,0x2060,0x3000,0x3001
]
ranges = [
  [0x9, 0xd],
  [0x20, 0x20],
  [0x85, 0x85],
  [0xa0, 0xa0],
  [0x1680, 0x1680],
  [0x2000, 0x200a],
  [0x2028, 0x2029],
  [0x202f, 0x202f],
  [0x205f, 0x205f],
  [0x3000, 0x3000],
]

macchiati · 2021-06-15T17:02:54Z

Just a quick question; the ranges field is just a different format for the serialized field, right?

sffc · 2021-06-15T17:05:48Z

Just a quick question; the ranges field is just a different format for the serialized field, right?

Correct. The two fields are intended to contain the same information in two different formats.

echeran

LGTM for the usefulness of the data for my purposes

icu4c/source/tools/toolutil/writesrc.cpp

echeran · 2021-06-16T17:19:30Z

icu4c/source/tools/upropdump/upropdump.cpp

+    const UCPMap* umap = u_getIntPropertyMap(uproperty, status);
+    handleError(status, fullPropName);
+
+    fputs("[code_point_map.data]\n", f);


suggestion: I think "code point map" is the interface, while code point trie and inversion map are specific concrete implementations / representations of that interface. Since this header [code_point_map.data] is a sibling to the header [code_point_trie.struct], I think it should be renamed to [inversion_map.data]

I would use [binary_property.data] and [enum_property.data] -- naming it for what it is rather than what data structure we use.

I changed the schema a bit. Now we have:

[[enum_property]] long_name = "General_Category" short_name = "gc" # Code points `a` through `b` have value `v`, corresponding to `name`. ranges = [ {a=0x0, b=0x1f, v=15, name="Cc"}, {a=0x20, b=0x20, v=12, name="Zs"}, ] [enum_property.code_point_trie] index = [] data_8 = [] indexLength = 3365 # ...

This becomes a file with a single property, enum_property, which is an array of objects with four fields: short_name, long_name, ranges, and code_point_trie.

This seems mildly useful because it could mean you could concatenate these files together and get a well-formed array of properties out of it.

CC @iainireland

06859bf

jefgen · 2021-06-16T18:11:00Z

@jefgen to let me know if I did the vcxproj files correctly (note: I copied the ones from makeconv, and then generated random GUIDs where it looked like I needed them)

Thanks. The vcxproj files look fine to me. Thanks for changing the GUIDs.

I can run the tool like this:

$ LD_LIBRARY_PATH=lib ./bin/upropdump whitespace

and I get whitespace.toml created in return:

# Copyright (C) 2021 and later: Unicode, Inc. and others.
# License & terms of use: http://www.unicode.org/copyright.html
#
# file name: whitespace
#
# machine-generated by: upropdump.cpp

[unicode_set.data]
name = "whitespace"
serialized = [
  0x14,9,0xe,0x20,0x21,0x85,0x86,0xa0,0xa1,0x1680,0x1681,0x2000,0x200b,0x2028,0x202a,0x202f,
  0x2030,0x205f,0x2060,0x3000,0x3001
]
ranges = [
  [0x9, 0xd],
  [0x20, 0x20],
  [0x85, 0x85],
  [0xa0, 0xa0],
  [0x1680, 0x1680],
  [0x2000, 0x200a],
  [0x2028, 0x2029],
  [0x202f, 0x202f],
  [0x205f, 0x205f],
  [0x3000, 0x3000],
]

FWIW, I can build and run the tool on Windows. 👍

However, the output I get in the whitespace.toml file doesn't exactly match what you got though (the copyright is missing, and names are different).

C:\icu4c\bin64>upropdump.exe whitespace

#
# file name: whitespace
#
# machine-generated by: upropdump.cpp

[unicode_set.data]
long_name = "White_Space"
name = "WSpace"
serialized = [
  0x14,9,0xe,0x20,0x21,0x85,0x86,0xa0,0xa1,0x1680,0x1681,0x2000,0x200b,0x2028,0x202a,0x202f,
  0x2030,0x205f,0x2060,0x3000,0x3001
]
ranges = [
  [0x9, 0xd],
  [0x20, 0x20],
  [0x85, 0x85],
  [0xa0, 0xa0],
  [0x1680, 0x1680],
  [0x2000, 0x200a],
  [0x2028, 0x2029],
  [0x202f, 0x202f],
  [0x205f, 0x205f],
  [0x3000, 0x3000],
]

jefgen · 2021-06-16T18:15:32Z

Ah, sorry I missed reviewing the changes in the allinone.sln file...

I think we'll want/need to add the following changes to the solution file:

diff --git a/icu4c/source/allinone/allinone.sln b/icu4c/source/allinone/allinone.sln
index 40e8fe6a74..7e8eef1a93 100644
--- a/icu4c/source/allinone/allinone.sln
+++ b/icu4c/source/allinone/allinone.sln
@@ -476,6 +476,22 @@ Global
                {4C8454FE-81D3-4CA3-9927-29BA96F03DAC}.Release|Win32.Build.0 = Release|Win32
                {4C8454FE-81D3-4CA3-9927-29BA96F03DAC}.Release|x64.ActiveCfg = Release|x64
                {4C8454FE-81D3-4CA3-9927-29BA96F03DAC}.Release|x64.Build.0 = Release|x64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|ARM.ActiveCfg = Debug|ARM
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|ARM.Build.0 = Debug|ARM
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|ARM64.ActiveCfg = Debug|ARM64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|ARM64.Build.0 = Debug|ARM64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|Win32.ActiveCfg = Debug|Win32
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|Win32.Build.0 = Debug|Win32
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|x64.ActiveCfg = Debug|x64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Debug|x64.Build.0 = Debug|x64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|ARM.ActiveCfg = Release|ARM
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|ARM.Build.0 = Release|ARM
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|ARM64.ActiveCfg = Release|ARM64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|ARM64.Build.0 = Release|ARM64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|Win32.ActiveCfg = Release|Win32
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|Win32.Build.0 = Release|Win32
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|x64.ActiveCfg = Release|x64
+               {C5185F6D-BC0A-4DF7-A63C-B107D1C9C82F}.Release|x64.Build.0 = Release|x64
                {203EC78A-0531-43F0-A636-285439BDE025}.Debug|ARM.ActiveCfg = Debug|ARM
                {203EC78A-0531-43F0-A636-285439BDE025}.Debug|ARM.Build.0 = Debug|ARM
                {203EC78A-0531-43F0-A636-285439BDE025}.Debug|ARM64.ActiveCfg = Debug|ARM64

.gitignore

icu4c/source/tools/toolutil/writesrc.cpp

icu4c/source/tools/upropdump/upropdump.cpp

markusicu · 2021-06-22T00:36:24Z

icu4c/source/tools/upropdump/upropdump.cpp

+    const UCPMap* umap = u_getIntPropertyMap(uproperty, status);
+    handleError(status, fullPropName);
+
+    fputs("[code_point_map.data]\n", f);


I would use [binary_property.data] and [enum_property.data] -- naming it for what it is rather than what data structure we use.

icu4c/source/tools/toolutil/writesrc.cpp

icu4c/source/tools/upropdump/upropdump.cpp

markusicu · 2021-06-22T00:55:06Z

Just a quick question; the ranges field is just a different format for the serialized field, right?

Correct. The two fields are intended to contain the same information in two different formats.

Yes, in principle. The "serialized" field in the current version contains the uint16_t-serialized form that ICU4C uses in some places. The first value has multiple fields & flags. Any range limit above 0xffff is split into two: 0x103456 --> 0x10, 0x3456. See https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1UnicodeSet.html#a9d26697666c30ec74d5955ac735d04d7

Resolves: - unicode-org#1741 (comment) - unicode-org#1741 (comment) - unicode-org#1741 (comment)

iainireland · 2021-08-30T17:21:53Z

What additional work is needed to support Script_Extensions?

sffc · 2021-08-30T18:29:00Z

What additional work is needed to support Script_Extensions?

Good question. We will likely need to add another code path for Script_Extensions that makes use of the specialized uscript_getScriptExtensions function.

@markusicu -- what is a good data structure to use to export Script_Extensions?

I may prefer to do this in a follow-up PR.

markusicu · 2021-08-30T20:57:20Z

What additional work is needed to support Script_Extensions?

Good question. We will likely need to add another code path for Script_Extensions that makes use of the specialized uscript_getScriptExtensions function.

@markusicu -- what is a good data structure to use to export Script_Extensions?

Is there an icu4x issue, or email thread, where we can discuss this?

I may prefer to do this in a follow-up PR.

yes!

markusicu · 2021-08-30T21:04:18Z

(Ignore if you don't have/add %lu here...)

On %lu vs. cast to long: fprintf() assumes that arguments are put on the stack in accordance with the string format. It's not type-safe at runtime. chars and shorts are always cast to int before they are pushed onto the stack, but I don't think that arguments necessarily go up to long. So while it may happen to work, I don't think it's quite portable to use %lu when the type may be a narrower int. Please cast to long just to be safe.

icu4c/source/tools/icuwriteuprops/Makefile.in

icu4c/source/tools/icuwriteuprops/icuwriteuprops.1.in

icu4c/source/tools/toolutil/writesrc.cpp

icu4c/source/tools/icuwriteuprops/icuwriteuprops.cpp

sffc · 2021-08-31T00:26:50Z

What additional work is needed to support Script_Extensions?

Good question. We will likely need to add another code path for Script_Extensions that makes use of the specialized uscript_getScriptExtensions function.
@markusicu -- what is a good data structure to use to export Script_Extensions?

Is there an icu4x issue, or email thread, where we can discuss this?

https://unicode-org.atlassian.net/browse/ICU-21545

sffc

I tested the new code against emoji and it works as expected. Thanks for catching that issue!

icu4c/source/tools/icuwriteuprops/icuwriteuprops.cpp

icu4c/source/tools/toolutil/writesrc.cpp

icu4c/source/tools/icuwriteuprops/icuwriteuprops.cpp

icu4c/source/tools/toolutil/writesrc.cpp

icu4c/source/tools/icuwriteuprops/Makefile.in

icu4c/source/tools/icuwriteuprops/icuwriteuprops.cpp

icu4c/source/tools/toolutil/writesrc.h

markusicu

changes lgtm pse squash

See unicode-org#1741

markusicu · 2021-09-08T19:09:34Z

I didn't see the helpful "force-pushed with no changes" notification (nor one like "force-pushed, look at diffs here"). Is that bot down?

markusicu

lgtm tnx -- assuming that the squash didn't change anything (hard to tell)

markusicu · 2021-09-08T19:17:30Z

Now that PR #1848 is merged you might want to run the tool again and look at the output for the emoji properties of strings :-)

sffc · 2021-09-08T20:27:36Z

Hmm, I don't know why the bot didn't post an update as it usually does. However, when I performed the squash, I used the web UI, and I did a sanity check, and all looks OK.

Now that PR #1848 is merged you might want to run the tool again and look at the output for the emoji properties of strings :-)

Cool! I would like to merge this PR first, to get it in, and if any changes are needed for emoji properties of strings, I'll cover them in a follow-up.

sffc requested review from markusicu, echeran and jefgen June 12, 2021 08:04

sffc marked this pull request as ready for review June 12, 2021 08:04

sffc mentioned this pull request Jun 13, 2021

Load ICU4C data export into ICU4X unicode-org/icu4x#578

Closed

echeran previously approved these changes Jun 16, 2021

View reviewed changes

markusicu self-assigned this Jun 16, 2021

markusicu requested changes Jun 22, 2021

View reviewed changes

This was referenced Jul 22, 2021

Import ICU4C binary properties data into ICU4X unicode-org/icu4x#882

Closed

Import ICU4C enumerated properties data into ICU4X unicode-org/icu4x#883

Closed

Minimal uprops provider unicode-org/icu4x#885

Merged

sffc dismissed echeran’s stale review via 617f6c5 August 19, 2021 20:10

sffc added a commit to sffc/icu that referenced this pull request Aug 19, 2021

Refactor usrc_writeUnicodeSet

13ee49a

Resolves: - unicode-org#1741 (comment) - unicode-org#1741 (comment) - unicode-org#1741 (comment)

sffc requested a review from markusicu August 19, 2021 23:37

This comment has been minimized.

Sign in to view

sffc mentioned this pull request Aug 26, 2021

Tracking issue: ICU-21545 Export UCPTrie data to array buffers unicode-org/icu4x#509

Closed

sffc added the waiting-on-reviewer label Aug 30, 2021

markusicu changed the title ~~ICU-21545 Add upropdump tool~~ ICU-21545 Add icuwriteuprops tool Aug 30, 2021

markusicu reviewed Aug 30, 2021

View reviewed changes

markusicu removed the waiting-on-reviewer label Aug 30, 2021

sffc commented Sep 2, 2021

View reviewed changes

sffc requested a review from markusicu September 2, 2021 01:27

markusicu reviewed Sep 2, 2021

View reviewed changes

ICU-21545 Add icuwriteuprops tool

619b4a3

See unicode-org#1741

sffc force-pushed the ICU-21545-propdump branch from fd9bcb1 to 619b4a3 Compare September 7, 2021 22:16

sffc requested a review from markusicu September 7, 2021 22:16

markusicu approved these changes Sep 8, 2021

View reviewed changes

sffc merged commit 92db251 into unicode-org:main Sep 8, 2021

sffc deleted the ICU-21545-propdump branch September 8, 2021 20:27

markusicu mentioned this pull request Sep 27, 2021

ICU-21545 fix Unicode properties Bazel build #1883

Merged

7 tasks

yumaoka mentioned this pull request Mar 14, 2022

ICU-21900 BRS71 Updated serialization test data for 71.1 #2032

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICU-21545 Add icuwriteuprops tool #1741

ICU-21545 Add icuwriteuprops tool #1741

sffc commented Jun 12, 2021

sffc commented Jun 12, 2021

macchiati commented Jun 15, 2021

sffc commented Jun 15, 2021

echeran left a comment

echeran Jun 16, 2021

markusicu Jun 22, 2021

sffc Aug 19, 2021

jefgen commented Jun 16, 2021

jefgen commented Jun 16, 2021

markusicu Jun 22, 2021

markusicu commented Jun 22, 2021

This comment has been minimized.

iainireland commented Aug 30, 2021

sffc commented Aug 30, 2021

markusicu commented Aug 30, 2021

markusicu commented Aug 30, 2021 •

edited

Loading

sffc commented Aug 31, 2021

sffc left a comment

markusicu left a comment

markusicu commented Sep 8, 2021

markusicu left a comment

markusicu commented Sep 8, 2021

sffc commented Sep 8, 2021

ICU-21545 Add icuwriteuprops tool #1741

ICU-21545 Add icuwriteuprops tool #1741

Conversation

sffc commented Jun 12, 2021

Checklist

sffc commented Jun 12, 2021

macchiati commented Jun 15, 2021

sffc commented Jun 15, 2021

echeran left a comment

Choose a reason for hiding this comment

echeran Jun 16, 2021

Choose a reason for hiding this comment

markusicu Jun 22, 2021

Choose a reason for hiding this comment

sffc Aug 19, 2021

Choose a reason for hiding this comment

jefgen commented Jun 16, 2021

jefgen commented Jun 16, 2021

markusicu Jun 22, 2021

Choose a reason for hiding this comment

markusicu commented Jun 22, 2021

This comment has been minimized.

iainireland commented Aug 30, 2021

sffc commented Aug 30, 2021

markusicu commented Aug 30, 2021

markusicu commented Aug 30, 2021 • edited Loading

sffc commented Aug 31, 2021

sffc left a comment

Choose a reason for hiding this comment

markusicu left a comment

Choose a reason for hiding this comment

markusicu commented Sep 8, 2021

markusicu left a comment

Choose a reason for hiding this comment

markusicu commented Sep 8, 2021

sffc commented Sep 8, 2021

markusicu commented Aug 30, 2021 •

edited

Loading