Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load ICU4C data export into ICU4X #578

Closed
sffc opened this issue Mar 27, 2021 · 5 comments
Closed

Load ICU4C data export into ICU4X #578

sffc opened this issue Mar 27, 2021 · 5 comments
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters duplicate This issue or pull request already exists S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Milestone

Comments

@sffc
Copy link
Member

sffc commented Mar 27, 2021

After the tracking issue #509 is finished, we should pull the new ICU4C data export file into ICU4X. The data should live alongside the CLDR JSON data in the testdata project. It should leverage the same download mechanism (#414). This should be used as a data source both for the CodePointTries as well as for ppucd.txt (split from #576).

@sffc sffc added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters blocked A dependency must be resolved before this is actionable labels Mar 27, 2021
@sffc sffc self-assigned this Apr 19, 2021
@sffc sffc added this to the ICU4X 0.3 milestone Apr 19, 2021
@dminor

This comment has been minimized.

@sffc sffc added the S-medium Size: Less than a week (larger bug fix or enhancement) label May 14, 2021
@sffc
Copy link
Member Author

sffc commented Jun 13, 2021

CC @iainireland, @echeran, @zbraniecki, @Manishearth

I implemented the ICU4C side of this in unicode-org/icu#1741. It produces TOML files such as:

Binary property example:

# White_Space.toml

[unicode_set.data]
long_name = "White_Space"
name = "WSpace"
serialized = [
  0x14,9,0xe,0x20,0x21,0x85,0x86,0xa0,0xa1,0x1680,0x1681,0x2000,0x200b,0x2028,0x202a,0x202f,
  0x2030,0x205f,0x2060,0x3000,0x3001
]
ranges = [
  [0x9, 0xd],
  [0x20, 0x20],
  [0x85, 0x85],
  [0xa0, 0xa0],
  [0x1680, 0x1680],
  [0x2000, 0x200a],
  [0x2028, 0x2029],
  [0x202f, 0x202f],
  [0x205f, 0x205f],
  [0x3000, 0x3000],
]

Enumerated property example:

# General_Category.toml

[code_point_map.data]
long_name = "General_Category"
name = "gc"
ranges = [
  [0x0, 0x1f, 15, "Control", "Cc"],
  [0x20, 0x20, 12, "Space_Separator", "Zs"],
  [0x21, 0x23, 23, "Other_Punctuation", "Po"],
  [0x24, 0x24, 25, "Currency_Symbol", "Sc"],
  # ...
  [0x100000, 0x10fffd, 17, "Private_Use", "Co"],
  [0x10fffe, 0x10ffff, 0, "Unassigned", "Cn"],
]

[code_point_trie.struct]
long_name = "General_Category"
name = "gc"
index = [
  # ...
]
data_32 = [
  # ...
]
indexLength = 3366
dataLength = 9743
highStart = 0x110000
shifted12HighStart = 0x110
type = 1
valueWidth = 1
index3NullOffset = 0x787
dataNullOffset = 0x74a
nullValue = 0x0

The next steps are as follows:

  1. Add the data files needed for irregex to the testdata directory, perhaps at /provider/testdata/data/uprops/...
  2. Create a new DataProvider transformer that reads from these TOML files and produces PropertiesV1 data structs for them. Note that this step will involve building an ICU4X UnicodeSet by consuming the ranges in the TOML files.
  3. Plug that transformer into icu4x-datagen
  4. Ensure that the data is coming through the uniset::props APIs correctly by adding more unit tests

I do not need to do this, and it would be good for team education if someone else did it. I will of course be available to advise. Volunteers? If there are no volunteers, I will try to do this myself before the end of the quarter.

@iainireland
Copy link
Contributor

iainireland commented Jun 16, 2021

This looks really good and useful! Ashwini will need this for her irregexp work.

Edit: specifically, Ashwini will need the uniset::props API, and (unless I'm completely misunderstanding how cbindgen works) FFI bindings to use that API from C++.

@sffc sffc removed the blocked A dependency must be resolved before this is actionable label Jun 24, 2021
@sffc sffc assigned iainireland and unassigned sffc Jun 24, 2021
@sffc sffc modified the milestones: ICU4X 0.3, ICU4X 0.4 Jul 22, 2021
@sffc
Copy link
Member Author

sffc commented Oct 19, 2021

Closing this issue as a duplicate of #148

@sffc sffc closed this as completed Oct 19, 2021
@sffc sffc added the duplicate This issue or pull request already exists label Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters duplicate This issue or pull request already exists S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

No branches or pull requests

3 participants