Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First iteration of documentation for API boundary string representation #49

Merged
merged 5 commits into from
May 8, 2020

Conversation

hsivonen
Copy link
Member

Closes #14.

@CLAassistant
Copy link

CLAassistant commented Apr 20, 2020

CLA assistant check
All committers have signed the CLA.

@hsivonen hsivonen force-pushed the stringrepr branch 4 times, most recently from 3d3c5f7 to 0ea2195 Compare April 20, 2020 14:28
@hsivonen
Copy link
Member Author

Hmm. Maybe I should replace the appender function last resort with an ICU4X-allocated repr(C) string object after all.

@hsivonen
Copy link
Member Author

Hmm. Or maybe there isn't even a need for a repr(C) object but just the buffer taken from a Rust Vec/String and its length in an outparam. I'll update the doc accordingly.

@hsivonen
Copy link
Member Author

Updated with guidance that aims to allocate once in the best case, with example of the case where there's no good worst-case estimation, and with a C++ wrapper sample.

@sffc
Copy link
Member

sffc commented Apr 24, 2020

Thanks for this! Looks great overall.

One thought to start: It's common that ICU and 402 functions either accept string input (collation, segmentation, parsing) or return string output (formatting), but not necessarily both. Although there are cases of functions that take strings as both input and output (you have case mapping as an example), ICU functions frequently map between strings and machine types (doubles, integers, etc). Therefore, I'm curious how this affects the memory allocation section where you discuss the need to make a guess at the worst-case output string size.

@nciric
Copy link
Contributor

nciric commented Apr 24, 2020 via email

@sffc
Copy link
Member

sffc commented Apr 25, 2020

I think it would be relevant to this doc to discuss a scenario such as the formatting APIs where the length of the output string is known after some logic is performed. For example, could ICU4X "ask" the host for a buffer of a certain size via a function pointer? Is that conventional?

Here's an example:

type FnAlloc<T> = dyn Fn(usize) -> Box<[T]>;

fn format(_input: f64, alloc: &FnAlloc<u8>) -> Box<[u8]> {
    // Dummy implementation
    let mut result = alloc(3);
    result[0] = 0x41;
    result[1] = 0x42;
    result[2] = 0x43;
    return result;
}

fn alloc_impl(len: usize) -> Box<[u8]> {
    vec![0u8; len].into_boxed_slice()
}

fn main() {
    let result_box = format(1.0, &alloc_impl);
    let result_string = String::from_utf8(result_box.to_vec()).unwrap();
    println!("{}", result_string);
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=eb2248890ed1a746ba3155b7a31c62c9

@macchiati
Copy link
Member

macchiati commented Apr 25, 2020 via email

@hsivonen
Copy link
Member Author

Although there are cases of functions that take strings as both input and output (you have case mapping as an example), ICU functions frequently map between strings and machine types (doubles, integers, etc).

Per discussion on the call, I'll add clarifying notes that we'd end up using what the document now positions as the last resort case way more often than the document now implies.

For example, could ICU4X "ask" the host for a buffer of a certain size via a function pointer? Is that conventional?

So ICU4X would call the allocator, but what the allocator is would be caller-customatizable instead of being a matter of customizing the global allocator for Rust code at build time.

Is that use case something that would actually be useful compared to either

  1. Using the same allocator for Rust code and the caller, in which case it's OK to use C free() to free buffers returned by ICU4X.
  2. Building all Rust code in the app with the specific allocator that you want instead of passing its entry point to each ICU4X call.

That is, is there a use case worth supporting where you'd want to have a different allocator for buffers ICU4X and for other Rust code?

However, it is often not too much work to compute a maximum size, and report the size actually used afterwards. What about that?

This is what the slice-writing APIs would do. Additionally, over the weekend, I figured out how to use Rust String/Vec<u16> in a way that moves the shrink-to-fit policy to the caller. This would expose both the used length and the capacity to the FFI caller.

I'll add a new changeset to this PR.

@hsivonen
Copy link
Member Author

Additionally, over the weekend, I figured out how to use Rust String/Vec<u16> in a way that moves the shrink-to-fit policy to the caller.

Unfortunately, the API I had in mind is nightly-only still.

@sffc
Copy link
Member

sffc commented Apr 27, 2020

So ICU4X would call the allocator, but what the allocator is would be caller-customatizable instead of being a matter of customizing the global allocator for Rust code at build time.

Is that use case something that would actually be useful compared to either

  1. Using the same allocator for Rust code and the caller, in which case it's OK to use C free() to free buffers returned by ICU4X.
  2. Building all Rust code in the app with the specific allocator that you want instead of passing its entry point to each ICU4X call.

That is, is there a use case worth supporting where you'd want to have a different allocator for buffers ICU4X and for other Rust code?

I was mainly thinking about the FFI use case. This would be a way for the FFI client to allocate a host-side buffer and have ICU4X fill it in with the correct character encoding. Otherwise, if we allocate the string on the Rust side, it's the host's responsibility to copy the contents out of the buffer to its own buffer. Maybe that pattern is okay, but I wanted to pose the question.

@zbraniecki
Copy link
Member

Unfortunately, the API I had in mind is nightly-only still.

We could try to check with the Rust team if this API is on the path to stabilization. If it is and we can hope to have it in Rust stable within 6 months, I wouldn't mind designing with that in mind.

@hsivonen
Copy link
Member Author

hsivonen commented Apr 28, 2020

Otherwise, if we allocate the string on the Rust side, it's the host's responsibility to copy the contents out of the buffer to its own buffer. Maybe that pattern is okay, but I wanted to pose the question.

That would still require e.g. a C++ caller to have a custom string class that can adopt a buffer allocated externally to the string class. (As opposed to the string class insisting on managing its buffer strictly on its own.)

If the Rust code is built with the same allocator as the host code, the host code can adopt a buffer that was allocated by ICU4X just as if it came directly from the usual host allocator. If the Rust code is built with a different allocator than the host code, it's still possible to have a custom host string class that adopts the buffer if it can use the ICU4X-provided freeing function and can pass not only the pointer but also the capacity to the freeing function. So I'm skeptical of the value of complicating the ICU4X API by making the allocator selectable on the call sites.

(Rust supports allocators that, unlike C allocators, aren't capable of looking up the allocation size given a pointer previously returned by the allocator and that instead trust the application to pass back the allocation size. Such an allocator actually exists for Wasm.)

We could try to check with the Rust team if this API is on the path to stabilization.

I checked on the issue, but my check was marked resolved as noise, so you need to expand it to see it. However, by checking, I did learn something that I had missed: while into_raw_parts is nightly-only, from_raw_parts is not and one can build a function equivalent to into_raw_parts from pieces already available on release. I'll update the proposal here accordingly.

(The stabilization of into_raw_parts appears to be blocked by two tiny decisions: 1) method or free function and 2) whether the returned pointer is annotated for non-nullness since such annotation is now possible or not annotated for consistency with from_raw_parts and Box::into_raw_parts.)

@hsivonen
Copy link
Member Author

I believe I have now addressed the concerns from last weeks call. Please let me know if I missed something that I should edit in.

sffc
sffc previously approved these changes Apr 30, 2020
docs/string-representation.md Outdated Show resolved Hide resolved
docs/string-representation.md Outdated Show resolved Hide resolved
@hsivonen hsivonen merged commit 7d0cf06 into unicode-org:master May 8, 2020
@hsivonen hsivonen deleted the stringrepr branch May 8, 2020 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

String encodings (UTF-8/UTF-16)
7 participants