Skip to content

Commit

Permalink
lazysource
Browse files Browse the repository at this point in the history
  • Loading branch information
robertbastian committed Jan 4, 2023
1 parent d686c87 commit 117b4ff
Show file tree
Hide file tree
Showing 8 changed files with 256 additions and 171 deletions.
21 changes: 8 additions & 13 deletions docs/tutorials/data_management.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,18 +23,13 @@ Get a coffee, this might take a while ☕.
Once installed, run:

```console
$ icu4x-datagen --cldr-tag latest --icuexport-tag latest --out my_data_blob.postcard --format blob --all-keys --all-locales
$ icu4x-datagen --format blob --out my_data_blob.postcard
```

Let's dissect this invocation:
* `--cldr-tag` selects the CLDR version to use
* `--icuexport-tag` selects the ICU-exported data version to use
* `--out` is the location where we want the generated ICU4X data to be stored
* `--format` sets the format of the output (we'll discuss formats later)
* `--all-keys` `--all-locales` specifies that we want to include all data for all locales

This will generate a `my_data_blob.postcard` file containing the serialized data for all components in all locales. The file is several megabytes large; we will optimize it later in the tutorial!

`icu4x-datagen` has many options, some of which we'll discover below. The default options should work for most purposes, but check out `icu4x-datagen --help` to learn more about fine-tuning your data.

# 3. Using the generated data

Once we have generated some data, it needs to be loaded as a data provider. The blob format we chose can be loaded by `BlobDataProvider` from the `icu_provider_blob` crate.
Expand Down Expand Up @@ -83,13 +78,13 @@ fn main() {

# 4. Data slicing

You might have noticed that the blob we generated is a hefty 13MB. This is no surprise, as we included `--all-keys` `--all-locales`. However, our binary only uses date formatting data in Japanese. There's room for optimization:
You might have noticed that the blob we generated is a hefty 13MB. This is no surprise, as by default all keys and for all locales are included. However, our binary only uses date formatting data in Japanese. There's room for optimization:

```console
$ icu4x-datagen --overwrite --cldr-tag latest --icuexport-tag latest --out my_data_blob.postcard --format blob --keys-for-bin target/debug/myapp --locales ja
$ icu4x-datagen --format blob --out my_data_blob.postcard --overwrite --keys-for-bin target/debug/myapp --locales ja
```

The `--keys-for-bin` argument tells `icu4x-datagen` to analyze the binary and only include keys that are used by its code. In addition, we know that we only need data for the Japanese locale. This significantly reduces the blob's file size, to 54KB, and our program still works. Quite the improvement!
The `--keys-for-bin` argument tells `icu4x-datagen` to analyze the binary and only include keys that are used by its code. In addition, we know that we only need data for the Japanese locale, which we can supply with the `--locales` argument. This significantly reduces the blob's file size, to 54KB, and our program still works. Quite the improvement!

But there is more to optimize. You might have noticed this in the output of the `icu4x-datagen` invocation, which lists 21 keys, including clearly irrelevant ones like `datetime/ethopic/datesymbols@1`. Remember how we had to convert our `DateTime<Gregorian>` into a `DateTime<AnyCalendar>` in order to use the `DateTimeFormatter`? Turns out, as `DateTimeFormatter` contains logic for many different calendars, datagen includes data for all of these as well.

Expand Down Expand Up @@ -140,7 +135,7 @@ The `mod` format will generate a Rust module that contains all the required data
Let's give it a try:

```console
$ icu4x-datagen --cldr-tag latest --icuexport-tag latest --out my-data-mod --format mod --keys-for-bin target/debug/myapp --locales ja
$ icu4x-datagen --format mod --out my-data-mod --keys-for-bin target/debug/myapp --locales ja
```

The output might tell you additional crates that need to be installed. Don't worry, these are transitive dependencies already anyway, but are required directly now to construct our data:
Expand Down Expand Up @@ -197,7 +192,7 @@ The `dir` format will generate a directory tree of data files in JSON (although
Let's give it a try:

```console
$ icu4x-datagen --cldr-tag latest --icuexport-tag latest --out my-data-dir --format dir --keys-for-bin target/debug/myapp --locales ja
$ icu4x-datagen --format dir --out my-data-dir --keys-for-bin target/debug/myapp --locales ja
```

This directory can be read by the `FsDataProvider` from the `icu_provider_fs` crate. You will also need to activate the Cargo feature for the chosen syntax on the `icu_provider` crate.
Expand Down
23 changes: 23 additions & 0 deletions provider/adapters/src/empty.rs
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,26 @@ where
Err(self.error_kind.with_req(M::KEY, base_req))
}
}

#[cfg(feature = "datagen")]
impl<M> icu_provider::datagen::IterableDataProvider<M> for EmptyDataProvider
where
M: KeyedDataMarker,
{
fn supported_locales(&self) -> Result<alloc::vec::Vec<DataLocale>, DataError> {
Ok(vec![])
}
}

#[cfg(feature = "datagen")]
impl<M> icu_provider::datagen::IterableDynamicDataProvider<M> for EmptyDataProvider
where
M: DataMarker,
{
fn supported_locales_for_key(
&self,
_: DataKey,
) -> Result<alloc::vec::Vec<DataLocale>, DataError> {
Ok(vec![])
}
}
129 changes: 50 additions & 79 deletions provider/datagen/src/bin/datagen.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,6 @@ use eyre::WrapErr;

use icu_datagen::*;
use icu_locid::LanguageIdentifier;
use icu_provider::hello_world::HelloWorldV1Marker;
use icu_provider::prelude::*;
use icu_provider_fs::export::serializers::{bincode, json, postcard};
use simple_logger::SimpleLogger;
use std::path::PathBuf;
Expand All @@ -29,11 +27,11 @@ fn main() -> eyre::Result<()> {
Arg::with_name("FORMAT")
.long("format")
.takes_value(true)
.required(true)
.possible_value("dir")
.possible_value("blob")
.possible_value("mod")
.help("Output to a directory on the filesystem, a single blob, or a Rust module.")
.default_value("dir"),
.help("Select the output format: a directory tree of files, a single blob, or a Rust module."),
)
.arg(
Arg::with_name("OVERWRITE")
Expand All @@ -49,26 +47,28 @@ fn main() -> eyre::Result<()> {
.possible_value("json")
.possible_value("bincode")
.possible_value("postcard")
.help("File format syntax for data files."),
.help("--format=dir only: serde serialization format."),
)
.arg(
Arg::with_name("PRETTY")
.short("p")
.long("pretty")
.help("Whether to pretty-print the output files. Only affects JSON and Rust modules."),
.help("--format=mod, --format=dir only: pretty-print the Rust or JSON output files."),
)
.arg(
Arg::with_name("FINGERPRINT")
.long("fingerprint")
.help("Whether to add a fingerprints file to the output. Not compatible with --format=blob")
.help("--format=dir only: whether to add a fingerprints file to the output.")
)
.arg(
Arg::with_name("CLDR_TAG")
.short("t")
.long("cldr-tag")
.value_name("TAG")
.default_value("latest")
.help(
"Download CLDR JSON data from this GitHub tag (https://github.com/unicode-org/cldr-json/tags)\n\
Ignored if '--cldr-root' is present.\n\
Note that some keys do not support versions before 41.0.0.",
)
.takes_value(true),
Expand All @@ -79,7 +79,6 @@ fn main() -> eyre::Result<()> {
.value_name("PATH")
.help(
"Path to a local cldr-{version}-json-full.zip directory (see https://github.com/unicode-org/cldr-json/releases).\n\
Ignored if '--cldr-tag' is present.\n\
Note that some keys do not support versions before 41.0.0.",
)
.takes_value(true),
Expand All @@ -88,8 +87,10 @@ fn main() -> eyre::Result<()> {
Arg::with_name("ICUEXPORT_TAG")
.long("icuexport-tag")
.value_name("TAG")
.default_value("latest")
.help(
"Download Unicode Properties data from this GitHub tag (https://github.com/unicode-org/icu/tags)\n\
Ignored if '--icuexport-root is present.\n\
Note that some keys do not support versions before release-71-1."
)
.takes_value(true),
Expand All @@ -100,7 +101,6 @@ fn main() -> eyre::Result<()> {
.value_name("PATH")
.help(
"Path to a local icuexportdata_uprops_full directory (see https://github.com/unicode-org/icu/releases).\n\
Ignored if '--icuexport-tag is present.\n\
Note that some keys do not support versions before release-71-1.",
)
.takes_value(true),
Expand Down Expand Up @@ -150,7 +150,7 @@ fn main() -> eyre::Result<()> {
.takes_value(true)
.help(
"Include this resource key in the output. Accepts multiple arguments. \
Also see --key-file.",
Omit this option to include all keys, use 'none' to include no keys.",
),
)
.arg(
Expand All @@ -159,7 +159,7 @@ fn main() -> eyre::Result<()> {
.takes_value(true)
.help(
"Path to text file with resource keys to include, one per line. Empty lines \
and lines starting with '#' are ignored. Also see --key.",
and lines starting with '#' are ignored.",
),
)
.arg(
Expand All @@ -170,24 +170,11 @@ fn main() -> eyre::Result<()> {
"Analyzes the binary and only includes keys that are used by the binary."
),
)
.arg(
Arg::with_name("HELLO_WORLD")
.long("hello-world-key")
.help("Whether to include the 'hello world' key."),
)
.arg(
Arg::with_name("ALL_KEYS")
.long("all-keys")
.help("Include all keys known to ICU4X."),
)
.group(
ArgGroup::with_name("KEY_MODE")
.arg("KEYS")
.arg("KEY_FILE")
.arg("HELLO_WORLD")
.arg("ALL_KEYS")
.arg("KEYS_FOR_BIN")
.required(true),
.arg("KEYS_FOR_BIN"),
)
.arg(
Arg::with_name("LOCALES")
Expand All @@ -197,38 +184,28 @@ fn main() -> eyre::Result<()> {
.takes_value(true)
.help(
"Include this locale in the output. Accepts multiple arguments. \
Omit this option to include all locales.",
Omit this option to include all locales, use 'none' to include no locales.",
),
)
.arg(
Arg::with_name("ALL_LOCALES")
.long("all-locales")
.help("Include all locales present in the input CLDR JSON."),
)
.group(
ArgGroup::with_name("LOCALE_MODE")
.arg("LOCALES")
.arg("ALL_LOCALES")
.required(true),
)
.arg(
Arg::with_name("OUTPUT")
.short("o")
.long("out")
.help(
"Path to output directory or file. Must be empty or non-existent, unless \
--overwrite is present, in which case the directory is deleted first. \
For --format blob, omit this option to dump to stdout.",
For --format=blob, omit this option to dump to stdout. \
For --format={dir,mod} defaults to 'data'.",
)
.takes_value(true),
)
.arg(Arg::with_name("INSERT_FEATURE_GATES")
.long("insert-feature-gates")
.help("Module-mode only: Insert per-key feature gates for each key's crate.")
.help("--format=mod only: insert feature gates for individual `icu_*` crates. Requires --use-separate-crates")
)
.arg(Arg::with_name("USE_SEPARATE_CRATES")
.long("use-separate-crates")
.help("Module-mode only: Use types from individual crates instead of the `icu` meta-crate.")
.help("--format=mod only: use types from individual `icu_*` crates instead of the `icu` meta-crate.")
)
.get_matches();

Expand All @@ -245,28 +222,36 @@ fn main() -> eyre::Result<()> {
.unwrap()
}

let selected_locales = matches
.values_of("LOCALES")
.map(|ls| {
ls.map(|s| LanguageIdentifier::from_str(s).with_context(|| s.to_string()))
.collect::<Result<Vec<LanguageIdentifier>, eyre::Error>>()
})
.transpose()?;
let selected_locales = if matches.is_present("LOCALES") {
let locales = matches.values_of("LOCALES").unwrap();
if locales.len() == 1 && matches.value_of("LOCALES") == Some("none") {
Some(vec![])
} else {
Some(
locales
.map(|s| LanguageIdentifier::from_str(s).with_context(|| s.to_string()))
.collect::<Result<Vec<LanguageIdentifier>, eyre::Error>>()?,
)
}
} else {
None
};

let selected_keys = if matches.is_present("ALL_KEYS") {
icu_datagen::all_keys()
} else if matches.is_present("HELLO_WORLD") {
vec![HelloWorldV1Marker::KEY]
} else if let Some(paths) = matches.values_of("KEYS") {
icu_datagen::keys(&paths.collect::<Vec<_>>())
let selected_keys = if let Some(paths) = matches.values_of("KEYS") {
let keys = paths.collect::<Vec<_>>();
if keys == &["none"] {
vec![]
} else {
icu_datagen::keys(&keys)
}
} else if let Some(key_file_path) = matches.value_of_os("KEY_FILE") {
icu_datagen::keys_from_file(key_file_path)
.with_context(|| key_file_path.to_string_lossy().into_owned())?
} else if let Some(bin_path) = matches.value_of_os("KEYS_FOR_BIN") {
icu_datagen::keys_from_bin(bin_path)
.with_context(|| bin_path.to_string_lossy().into_owned())?
} else {
unreachable!();
icu_datagen::all_keys()
};

if selected_keys.is_empty() {
Expand All @@ -279,20 +264,20 @@ fn main() -> eyre::Result<()> {
};

let mut source_data = SourceData::default();
if Some("latest") == matches.value_of("CLDR_TAG") {
if let Some(path) = matches.value_of("CLDR_ROOT") {
source_data = source_data.with_cldr(PathBuf::from(path), cldr_locales)?;
} else if Some("latest") == matches.value_of("CLDR_TAG") {
source_data = source_data.with_cldr_latest(cldr_locales)?;
} else if let Some(tag) = matches.value_of("CLDR_TAG") {
source_data = source_data.with_cldr_for_tag(tag, cldr_locales)?;
} else if let Some(path) = matches.value_of("CLDR_ROOT") {
source_data = source_data.with_cldr(PathBuf::from(path), cldr_locales)?;
}

if Some("latest") == matches.value_of("ICUEXPORT_TAG") {
if let Some(path) = matches.value_of("ICUEXPORT_ROOT") {
source_data = source_data.with_icuexport(PathBuf::from(path))?;
} else if Some("latest") == matches.value_of("ICUEXPORT_TAG") {
source_data = source_data.with_icuexport_latest()?;
} else if let Some(tag) = matches.value_of("ICUEXPORT_TAG") {
source_data = source_data.with_icuexport_for_tag(tag)?;
} else if let Some(path) = matches.value_of("ICUEXPORT_ROOT") {
source_data = source_data.with_icuexport(PathBuf::from(path))?;
}

if matches.value_of("TRIE_TYPE") == Some("fast") {
Expand All @@ -310,15 +295,12 @@ fn main() -> eyre::Result<()> {
source_data.with_collations(collations.into_iter().map(String::from).collect());
}

let out = match matches
.value_of("FORMAT")
.expect("Option has default value")
{
let out = match matches.value_of("FORMAT").expect("required") {
"dir" => icu_datagen::Out::Fs {
output_path: matches
.value_of_os("OUTPUT")
.map(PathBuf::from)
.ok_or_else(|| eyre::eyre!("--out must be specified for --format=dir"))?,
.unwrap_or_else(|| PathBuf::from("data")),
serializer: match matches.value_of("SYNTAX") {
Some("bincode") => Box::new(bincode::Serializer::default()),
Some("postcard") => Box::new(postcard::Serializer::default()),
Expand All @@ -344,10 +326,11 @@ fn main() -> eyre::Result<()> {
mod_directory: matches
.value_of_os("OUTPUT")
.map(PathBuf::from)
.ok_or_else(|| eyre::eyre!("--out must be specified for --format=mod"))?,
.unwrap_or_else(|| PathBuf::from("data")),
pretty: matches.is_present("PRETTY"),
insert_feature_gates: matches.is_present("INSERT_FEATURE_GATES"),
use_separate_crates: matches.is_present("USE_SEPARATE_CRATES"),
overwrite: matches.is_present("OVERWRITE"),
},
_ => unreachable!(),
};
Expand All @@ -358,17 +341,5 @@ fn main() -> eyre::Result<()> {
&source_data,
vec![out],
)
.map_err(|e| -> eyre::ErrReport {
if icu_datagen::is_missing_cldr_error(e) {
eyre::eyre!(
"Either --cldr-tag or --cldr-root or --input-from-testdata must be specified"
)
} else if icu_datagen::is_missing_icuexport_error(e) {
eyre::eyre!(
"Either --icuexport-tag or --icuexport-root or --input-from-testdata must be specified"
)
} else {
e.into()
}
})
.map_err(eyre::ErrReport::from)
}
Loading

0 comments on commit 117b4ff

Please sign in to comment.