Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C/C++ FFI for BakedDataProvider #2743

Closed
makotokato opened this issue Oct 14, 2022 · 25 comments
Closed

C/C++ FFI for BakedDataProvider #2743

makotokato opened this issue Oct 14, 2022 · 25 comments
Assignees
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-epic Size: Major project (create smaller child issues) U-gecko User: Gecko

Comments

@makotokato
Copy link
Member

icu_capi cannot handle BakedDataProvider directly. Is there a way to handle it without modifying icu_capi?

Of course, we can handle it if we modify icu_capi like the following.

#[cfg(feature = "any_provider")]
include!(concat!(env!("CARGO_MANIFEST_DIR"), "/data/mod.rs"));
#[cfg(feature = "any_provider")]
include!(concat!(env!("CARGO_MANIFEST_DIR"), "/data/any.rs"));

#[diplomat::rust_link(BakedDataProvider, Struct)]
pub fn create_baked() -> Box<ICU4XDataProvider> {
    #[cfg(not(feature = "any_provider"))]
    panic!("Requires feature 'any_provider'");

    #[cfg(feature = "any_provider")]
    convert_any_provider(BakedDataProvider)
@sffc
Copy link
Member

sffc commented Oct 14, 2022

CC @Manishearth @robertbastian. We were discussing this.

@Manishearth
Copy link
Member

Yeah so there are two ways to do this:

  • Edit this into the icu_capi code, as mentioned
  • Write a new crate like icu_capi_staticlib, that contains this code. It does not use Diplomat, instead it uses a manual extern "C" function that returns a *mut ICU4XDataProvider. You can recoup the ICU4XDataProvider C++ object on the other side using the ICU4XDataProvider constructor from a capi::ICU4XDataProvider *.

The second is the preferred option. In general users are likely to want to depend on a shim crate like icu_capi_staticlib instead of directly on icu_capi, and you can put stuff in manually there.

In the long run, we would like datagen to autogenerate this code and the headers for it1, but it doesn't right now

Footnotes

  1. Just for the baked provider construction API; it should assume the existence of the other headers.

@sffc
Copy link
Member

sffc commented Oct 17, 2022

A few sub-issues:

  1. Document status quo
  2. Implement better solution (start with C)
  3. Integrate with Diplomat

Requires knowledge of how our FFI works.

@sffc sffc added this to the ICU4X 1.x Untriaged milestone Oct 17, 2022
@sffc sffc added good first issue Good for newcomers help wanted Issue needs an assignee S-epic Size: Major project (create smaller child issues) and removed good first issue Good for newcomers labels Oct 17, 2022
@sffc
Copy link
Member

sffc commented Dec 2, 2022

@hsivonen had some more ideas here, such as a nullptr data provider indicating that data should instead be loaded from a BakedDataProvider that never itself gets exposed over FFI.

@hsivonen
Copy link
Member

hsivonen commented Dec 2, 2022

I think the key observations are:

  1. If the app knows it's going to only use a BakedDataProvider, the provider reference doesn't need to cross FFI. Instead, the generated part of icu_capi could materialize a reference to the BakedDataProvider at the points where a provider is passed to the Rust side's _unstable constructor.
  2. If the first point happens and also the app uses its own C++ types for spans and strings, the app needs its own C++ header generation run anyway, at which point it would make sense to be able to request that Diplomat omit the provider argument in the generated C++.
  3. Even if the first and second point happen, it might still make sense to keep a meaningless provider argument on the C layer. This way, it would be possible for the app to incorporate third-party code that uses the ICU4X FFI and assumes the upstream ICU4X C ABI. (These uses would go to the BakedDataProvider per the first point regardless of what the third-party code passes as the provider.)

@sffc sffc removed the good first issue Good for newcomers label Dec 2, 2022
@sffc
Copy link
Member

sffc commented Dec 2, 2022

I think either myself or @robertbastian should take this issue.

@robertbastian
Copy link
Member

Without diving into this, I don't think a null pointer is the right API design in 2022.

I would like to explore extern "Rust" to separate icu_capi from the baked data.

@Manishearth
Copy link
Member

  • If the app knows it's going to only use a BakedDataProvider, the provider reference doesn't need to cross FFI. Instead, the generated part of icu_capi could materialize a reference to the BakedDataProvider at the points where a provider is passed to the Rust side's _unstable constructor.

So this seems pretty invasive to me. I think a very important thing to consider is how we can do this in a way that makes sense generally from Diplomat's POV.

Perhaps via a #[diplomat::attr(inject-mode, inject(some rust expression))] where we annotate each provider parameter with it, and when diplomat is run with inject-mode it generates APIs without the parameter, and instead initializes them from the injected code.

Note that from Diplomat's POV it would still need to be an instance of ICU4XDataProvider, we can't get rid of that indirection that easily. I guess it could be inject(some rust expression, some rust type) where the type is also replaced.

You'd also still need to support a way to inject the baked data into icu_capi. I suspect it can be done with some CFGs that require the baked data be found in some folder or the other.

In general I'm wary of jumping towards a solution in Diplomat for stuff like this, because it often feels kinda niche and while I definitely want to add Diplomat features I want them to be generally useful so that we can design them right.


There's another solution which I find to be cleaner: What if we didn't change Diplomat at all, but icu_capi's ICU4XDataProvider had a fourth Baked option, that when enabled requires baked data to exist in some folder (which we need to do anyway!)? Then users can disable any/buffer, and just set up the baked variant. This still involves the baked data provider crossing FFI, however it doesn't matter since if all other features are disabled, the impl of DataProvider for ICU4XDataProviderInner will always hit the baked provider (especially if we get rid of the Empty variant too), and this will all monomorphize correctly.

This can be done purely in CFG magic, and we'd add a configurable ICU4XDataProvider::new_baked()¹. The main tricky thing is figuring out how best to let people inject their own baked code (CFG + include!() + env!()?) but that's a problem inherent to this problem space.

¹ A thing I do want to do is make it possible for Diplomat to also respect CFGs and not generate headers in these cases.

@Manishearth
Copy link
Member

@robertbastian

I would like to explore extern "Rust" to separate icu_capi from the baked data.

What do you mean by this?

@Manishearth
Copy link
Member

Oh, also, a thing that does not solve the monomorphization problem, but does solve the problem of using baked providers, is cross-crate Diplomat support, which I also want to get.

@Manishearth
Copy link
Member

Technically we could implement the injection thing by doing this, btw:

fn try_new(#[cfg(not(baked))] provider: &..., options: ... ) -> ... {
    #[cfg(baked)] let provider = BakedProvider;

}

with the caveat that Diplomat does not yet support CFGs (but I plan to make that happen)

@sffc
Copy link
Member

sffc commented Dec 3, 2022

What if we didn't change Diplomat at all, but icu_capi's ICU4XDataProvider had a fourth Baked option, that when enabled requires baked data to exist in some folder

+1

@sffc sffc added C-data-infra Component: provider, datagen, fallback, adapters and removed help wanted Issue needs an assignee labels Dec 22, 2022
This was referenced Jan 5, 2023
@hsivonen hsivonen added the U-gecko User: Gecko label Feb 14, 2023
@Manishearth
Copy link
Member

@sffc @robertbastian and I decided today to go with the model in #2992 (as opposed to #2947), because it maintains the same model across Rust and FFI.

The basic idea is that we have a globaldata (name TBD) crate that contains a bunch of Rust code in macros, covering all baked testdata. Individual components use a macro to import the Rust code for the relevant keys, and expose try_new_with_globaldata() (name TBD) functions, behind a globaldata feature. FFI does something similar. Ideally components only macro-include the data for their own component, and if they need data from other components we can either doc(hidden) export LocalBakedData from each component and fork the provider, or we can use try_new() functions for that module. It's all internal, the precise strategy can be figured out.

globaldata can be replaced with a custom baked rust source using an env var.


There is one other model that we explored a bit but decided not to go for at the moment. It will be possible to do in the future and we can do it when someone needs it.

The main benefit it has is that it gives FFI the flexibility of having multiple subsets of baked data: with the globaldata model you either have all data, or you have one defined subset that's defined for the entire compilation tree. You can't do multiple.

However, a model we can also have is one where datagen is capable of generating baked data that additionally:

  • depends on icu_capi
  • has an FFI function that returns an ICU4XDataProvider constructed from the baked data
  • has the associated C++ bindings/etc

The final staticlib must depend on all such crates for compilation to work (this is not a problem if you're doing rlib-to-rlib-to-native linking the way Blaze does; you can continue doing that). But this gives FFI the level of flexibility pure Rust code already has when it comes to baked data.

Writing this down so that in the future if this need crops up, we have a path we can take.

@robertbastian

This comment was marked as outdated.

@Manishearth

This comment was marked as outdated.

@robertbastian

This comment was marked as outdated.

@robertbastian

This comment was marked as outdated.

@sffc
Copy link
Member

sffc commented May 24, 2023

Due to crates.io size limits, we need multiple smaller globaldata crates, so I think we should start with one data crate per component crate.

I put minimal value on ergonomics that primarily impact ICU4X engineers.

@robertbastian
Copy link
Member

I don't think the size limit is something that should affect our design, it's straightforward to get an exception. I don't want to end up with 15 data crates that all need to be replaced individually if they are vendored.

@sffc
Copy link
Member

sffc commented May 24, 2023

It's not just the crates.io limit. Smaller globaldata crates means that people who only want to build individual components of ICU4X don't need to download/vendor all data for all components, which is going to get quite large. It might also have a marginal improvement on build times, since the compiler doesn't need to tokenize all the macros we're not using.

(I thought we were already aligned on the idea of multiple smaller crates)

@robertbastian

This comment was marked as outdated.

@robertbastian
Copy link
Member

Name bikeshed!

  • We've used "globaldata" so far, but I think that's confusing. I actually thought it meant global as in global state, whereas Manish pointed out that it's meant to be global as in pertaining to all countries. "global" is a bit too overloaded in programming for my liking.
  • In Macro-based baked data #3449 I just called the feature data, which Manish pointed out is very generic.
  • builtin-data is a bit more specific, but isn't actually correct if the user provides their own data by environment variable
  • modern-data is another suggestion, as it contains the modern CLDR set, but has the same problem with BYOD
  • no-provider somewhat describes the type of constructors it unlocks, but it's a negative so weird as a feature name
  • baked-data is my favourite I think, it's a more precise version of builtin-data that conveys only the format, not where the data actually comes from

@Manishearth
Copy link
Member

It's also global as in global state I think 😄

@robertbastian
Copy link
Member

Note: All modern locales in one crate is 5.1MB for stable keys, 6.3MB for stable + experimental.

@robertbastian
Copy link
Member

I'm going to close this in favour of #2945 as it's basically discussing the same things, and the solution we're working on there trivially extends to FFI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-data-infra Component: provider, datagen, fallback, adapters S-epic Size: Major project (create smaller child issues) U-gecko User: Gecko
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants