Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FFI baked data #2947

Closed
wants to merge 11 commits into from
Closed

FFI baked data #2947

wants to merge 11 commits into from

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Jan 3, 2023

Fixes #2743

  • Added a COMPLETE option to databake:

    struct Provider;
    include!(...);
    impl_data_provider!(Provider, COMPLETE);

    This will implement all keys on Provider, even those that weren't generated. We need this as otherwise icu_capi won't compile unless it has complete data, but we don't want to rely just on DCE.

  • Added a baked provider to icu_capi. This uses data from the ICU4_FFI_BAKED_ROOT environment variable, or falls back to empty data

  • Split the Empty variant into Empty and Destroyed

    • Destroyed always returns errors (I wonder if we should just make it UB). We should teach Diplomat about ownership so we can let the compatibility layer deal with this
    • Empty now supports fork_by_key, fork_by_locale and enable_fallback (basically no-ops)
  • Updated the fixeddecimal_tiny example to use databake: 26k -> 21k

  • Removed any_provider as a default feature. The baked provider doesn't go through dyn AnyProvider, unless locale fallback is enabled, then we need the feature.

@robertbastian robertbastian force-pushed the ffibake branch 3 times, most recently from 4b77e6f to 0eecb14 Compare January 4, 2023 21:14
@robertbastian robertbastian force-pushed the ffibake branch 2 times, most recently from 029374e to 4c77d6f Compare January 4, 2023 21:55
@robertbastian robertbastian marked this pull request as ready for review January 5, 2023 17:35
@robertbastian robertbastian requested review from Manishearth, sffc and a team as code owners January 5, 2023 17:35
@robertbastian
Copy link
Member Author

Commits before 32afdd1 are from #2950

provider/datagen/src/databake.rs Show resolved Hide resolved
// Empty data generated with
// cargo run -p icu_datagen --features bin,icu_segmenter -- --format mod --use-separate-crates --keys none --out empty_bake
println!(
"cargo:rustc-env=ICU4X_FFI_BAKED_ROOT={}/empty_bake",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: we shouldn't break the ability to build this crate without build scripts working; there are a bunch of build environments that are bad at running build scripts and ideally this should be something completely unnecessary to run 99% of the time.

I guess we can just ask people using strange build systems to just set this env var, but idk.

I was under the impression baked data was going to be behind a feature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is why the GN CI is failing. I was hoping we wouldn't need to feature-gate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't immediately see why we need a build.rs to do this computation? Can we use std::option_env instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include! requires a string literal argument, which core::env!("FOO") is, and core::option_env!("FOO").unwrap_or_else("bar") isn't.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rust-lang/rust#48952

We could use https://crates.io/crates/default-env, but that's a proc macro. Still better than a build script I guess?

Copy link
Member

@Manishearth Manishearth Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clients would just have to pass in env=empty.rs for that to work. I don't consider that a huge change given that data slicing is anyway a weird cyclic part of the build with baked data.

(Literally can't see a clean way to do it with baked data in build systems like Blaze, for example)

Copy link
Member

@sffc sffc Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like it if we got automatic DCE-based data slicing in Bazel/GN

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bazel/etc really do not like cyclic deps, this would involve a genrule that is able to analyze the build graph, copy it over, and run it once. The compiled binary cannot be an input for generating code

I strongly suggest we make the recommended workflow there to use key files, and provide tools for manually generating that key file (which is fine!)

Anyway, we should have this discussion elsewhere I think (i'm filing an issue for a generic discussion that keeps cropping up, but we need another for how we plan to make keyextract work with such build systems

Copy link
Member

@sffc sffc Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that it would be nice if in Bazel/GN using baked data that we don't need to run datagen against a key file or the binary; that the dataful DataProvider<M> impls can be DCEd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. We could, though that's going to become a problem again if combinators are used.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should solve #2945 first?

Comment on lines 318 to 320
ICU4XDataProviderInner::Baked => panic!(
"Locale fallback for baked providers requires the 'any_provider' feature"
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought/Suggestion: You can do this without any_provider by adding another variant to the enum, BakedFallback or something like that. I think it's a common enough case to warrant that.

Nit: I think panic! might be too strong for this case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that. Problem is that fallback doesn't get DCE'd then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't try exactly that, I tried Baked(Option<LocaleFallbacker>). I guess having a separate variant will still work with DCE, but we get exponential blowup of forking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to teach diplomat about cfg, then this whole method can be cfg(any(feature = "any_provider", feature = "buffer_provider"))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we do cfgs inside the methods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried again with an extra BakedFallback(Vec<LocaleFallbacker>) option, but the compiler cannot figure out that a variant is never used, so it always pulls in the fallback code (which has to be in load).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the variant should also be cfg'd off, yes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would require a separate baked_fallback feature. I'm trying to reduce feature complexity these days...

ICU4XDataProvider(ICU4XDataProviderInner::Baked)
}
#[cfg(feature = "any_provider")]
(ICU4XDataProviderInner::Any(a), ICU4XDataProviderInner::Baked) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought/Issue: I think this means that if the caller invokes fork_by_key, even if they aren't using the baked provider, they get the massive Baked AnyProvider impl, with no way to disable it; even if it is an empty baked provider, that's still a lot of vtable and ZeroFrom

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but the AnyProvider itself will be the baked provider wrapped in locale fallback or forking, so there's no additional vtable.

// Empty data generated with
// cargo run -p icu_datagen --features bin,icu_segmenter -- --format mod --use-separate-crates --keys none --out empty_bake
println!(
"cargo:rustc-env=ICU4X_FFI_BAKED_ROOT={}/empty_bake",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't immediately see why we need a build.rs to do this computation? Can we use std::option_env instead?

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • .github/workflows/build-test.yml is no longer changed in the branch
  • docs/tutorials/data_management.md is no longer changed in the branch
  • ffi/diplomat/js/examples/node/build.sh is no longer changed in the branch
  • ffi/tinywasm/build.sh is no longer changed in the branch
  • provider/adapters/src/empty.rs is no longer changed in the branch
  • provider/blob/README.md is no longer changed in the branch
  • provider/blob/src/lib.rs is no longer changed in the branch
  • provider/datagen/Cargo.toml is no longer changed in the branch
  • provider/datagen/README.md is no longer changed in the branch
  • provider/datagen/src/bin/datagen.rs is no longer changed in the branch
  • provider/datagen/src/lib.rs is no longer changed in the branch
  • provider/datagen/src/source.rs is no longer changed in the branch
  • provider/datagen/src/transform/cldr/source.rs is no longer changed in the branch
  • provider/fs/README.md is no longer changed in the branch
  • provider/fs/src/lib.rs is no longer changed in the branch
  • provider/testdata/src/bin/datagen.rs is no longer changed in the branch
  • tools/scripts/data.toml is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

if (current_cpu == "wasm32") {
deps += [ ":log-v0_4_14" ]
}

rustenv = []
rustenv = [ "ICU4X_FFI_BAKED_ROOT=../empty_bake" ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd still like there to be a feature so that this isn't necessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@robertbastian
Copy link
Member Author

I'm blocking this on #3005 because it would add another feature to datagen.

@gregtatum gregtatum removed their request for review January 25, 2023 19:39
@sffc
Copy link
Member

sffc commented Feb 22, 2023

As discussed: start with the approach listed in #2992 and maybe revisit this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

C/C++ FFI for BakedDataProvider
3 participants