First iteration of documentation for API boundary string representation #49

hsivonen · 2020-04-20T14:21:53Z

Closes #14.

CLAassistant · 2020-04-20T14:22:14Z

All committers have signed the CLA.

hsivonen · 2020-04-20T14:31:44Z

Hmm. Maybe I should replace the appender function last resort with an ICU4X-allocated repr(C) string object after all.

hsivonen · 2020-04-20T18:26:12Z

Hmm. Or maybe there isn't even a need for a repr(C) object but just the buffer taken from a Rust Vec/String and its length in an outparam. I'll update the doc accordingly.

hsivonen · 2020-04-21T11:37:19Z

Updated with guidance that aims to allocate once in the best case, with example of the case where there's no good worst-case estimation, and with a C++ wrapper sample.

Closes unicode-org#14.

sffc · 2020-04-24T16:56:01Z

Thanks for this! Looks great overall.

One thought to start: It's common that ICU and 402 functions either accept string input (collation, segmentation, parsing) or return string output (formatting), but not necessarily both. Although there are cases of functions that take strings as both input and output (you have case mapping as an example), ICU functions frequently map between strings and machine types (doubles, integers, etc). Therefore, I'm curious how this affects the memory allocation section where you discuss the need to make a guess at the worst-case output string size.

nciric · 2020-04-24T18:20:09Z

One that takes strings and returns them is message format API (but that one is the hard on many fronts, including Wasm re-entry). пет, 24. апр 2020. у 09:56 Shane F. Carr <notifications@github.com> је написао/ла:

…

Thanks for this! Looks great overall. One thought to start: It's common that ICU and 402 functions either accept string input (collation, segmentation, parsing) or return string output (formatting), but not necessarily both. Although there are cases of functions that take strings as both input and output (you have case mapping as an example), ICU functions frequently maps between strings and machine types (doubles, integers, etc). Therefore, I'm curious how this affects the memory allocation section where you discuss the need to make a guess at the worst-case output string size. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#49 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA7GEKRTK7Y76TG4ITELVVLROHADBANCNFSM4MMPOR7Q> .

sffc · 2020-04-25T01:01:41Z

I think it would be relevant to this doc to discuss a scenario such as the formatting APIs where the length of the output string is known after some logic is performed. For example, could ICU4X "ask" the host for a buffer of a certain size via a function pointer? Is that conventional?

Here's an example:

type FnAlloc<T> = dyn Fn(usize) -> Box<[T]>;

fn format(_input: f64, alloc: &FnAlloc<u8>) -> Box<[u8]> {
    // Dummy implementation
    let mut result = alloc(3);
    result[0] = 0x41;
    result[1] = 0x42;
    result[2] = 0x43;
    return result;
}

fn alloc_impl(len: usize) -> Box<[u8]> {
    vec![0u8; len].into_boxed_slice()
}

fn main() {
    let result_box = format(1.0, &alloc_impl);
    let result_string = String::from_utf8(result_box.to_vec()).unwrap();
    println!("{}", result_string);
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=eb2248890ed1a746ba3155b7a31c62c9

macchiati · 2020-04-25T04:05:41Z

Often the cost of computing the buffer size ahead of time is a substantial proportion of the cost of filling the buffer, which makes the total cost not very efficient. However, it is often not too much work to compute a maximum size, and report the size actually used afterwards. What about that? Mark

…

On Fri, Apr 24, 2020 at 6:01 PM Shane F. Carr ***@***.***> wrote: I think it would be relevant to this doc to discuss a scenario such as the formatting APIs where the length of the output string is known after some logic is performed. For example, could ICU4X "ask" the host for a buffer of a certain size via a function pointer? Is that conventional? Here's an example: type FnAlloc<T> = dyn Fn(usize) -> Box<[T]>; fn format(_input: f64, alloc: &FnAlloc<u8>) -> Box<[u8]> { // Dummy implementation let mut result = alloc(3); result[0] = 0x41; result[1] = 0x42; result[2] = 0x43; return result; } fn alloc_impl(len: usize) -> Box<[u8]> { vec![0u8; len].into_boxed_slice() } fn main() { let result_box = format(1.0, &alloc_impl); let result_string = String::from_utf8(result_box.to_vec()).unwrap(); println!("{}", result_string); } https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=eb2248890ed1a746ba3155b7a31c62c9 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#49 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMEZL4DBVHYCH4ZHGT3ROIZAHANCNFSM4MMPOR7Q> .

hsivonen · 2020-04-27T07:51:59Z

Although there are cases of functions that take strings as both input and output (you have case mapping as an example), ICU functions frequently map between strings and machine types (doubles, integers, etc).

Per discussion on the call, I'll add clarifying notes that we'd end up using what the document now positions as the last resort case way more often than the document now implies.

For example, could ICU4X "ask" the host for a buffer of a certain size via a function pointer? Is that conventional?

So ICU4X would call the allocator, but what the allocator is would be caller-customatizable instead of being a matter of customizing the global allocator for Rust code at build time.

Is that use case something that would actually be useful compared to either

Using the same allocator for Rust code and the caller, in which case it's OK to use C free() to free buffers returned by ICU4X.
Building all Rust code in the app with the specific allocator that you want instead of passing its entry point to each ICU4X call.

That is, is there a use case worth supporting where you'd want to have a different allocator for buffers ICU4X and for other Rust code?

However, it is often not too much work to compute a maximum size, and report the size actually used afterwards. What about that?

This is what the slice-writing APIs would do. Additionally, over the weekend, I figured out how to use Rust String/Vec<u16> in a way that moves the shrink-to-fit policy to the caller. This would expose both the used length and the capacity to the FFI caller.

I'll add a new changeset to this PR.

hsivonen · 2020-04-27T10:39:50Z

Additionally, over the weekend, I figured out how to use Rust String/Vec<u16> in a way that moves the shrink-to-fit policy to the caller.

Unfortunately, the API I had in mind is nightly-only still.

sffc · 2020-04-27T19:36:03Z

So ICU4X would call the allocator, but what the allocator is would be caller-customatizable instead of being a matter of customizing the global allocator for Rust code at build time.

Is that use case something that would actually be useful compared to either

Using the same allocator for Rust code and the caller, in which case it's OK to use C free() to free buffers returned by ICU4X.

Building all Rust code in the app with the specific allocator that you want instead of passing its entry point to each ICU4X call.

That is, is there a use case worth supporting where you'd want to have a different allocator for buffers ICU4X and for other Rust code?

I was mainly thinking about the FFI use case. This would be a way for the FFI client to allocate a host-side buffer and have ICU4X fill it in with the correct character encoding. Otherwise, if we allocate the string on the Rust side, it's the host's responsibility to copy the contents out of the buffer to its own buffer. Maybe that pattern is okay, but I wanted to pose the question.

zbraniecki · 2020-04-27T21:54:53Z

Unfortunately, the API I had in mind is nightly-only still.

We could try to check with the Rust team if this API is on the path to stabilization. If it is and we can hope to have it in Rust stable within 6 months, I wouldn't mind designing with that in mind.

hsivonen · 2020-04-28T07:20:35Z

Otherwise, if we allocate the string on the Rust side, it's the host's responsibility to copy the contents out of the buffer to its own buffer. Maybe that pattern is okay, but I wanted to pose the question.

That would still require e.g. a C++ caller to have a custom string class that can adopt a buffer allocated externally to the string class. (As opposed to the string class insisting on managing its buffer strictly on its own.)

If the Rust code is built with the same allocator as the host code, the host code can adopt a buffer that was allocated by ICU4X just as if it came directly from the usual host allocator. If the Rust code is built with a different allocator than the host code, it's still possible to have a custom host string class that adopts the buffer if it can use the ICU4X-provided freeing function and can pass not only the pointer but also the capacity to the freeing function. So I'm skeptical of the value of complicating the ICU4X API by making the allocator selectable on the call sites.

(Rust supports allocators that, unlike C allocators, aren't capable of looking up the allocation size given a pointer previously returned by the allocator and that instead trust the application to pass back the allocation size. Such an allocator actually exists for Wasm.)

We could try to check with the Rust team if this API is on the path to stabilization.

I checked on the issue, but my check was marked resolved as noise, so you need to expand it to see it. However, by checking, I did learn something that I had missed: while into_raw_parts is nightly-only, from_raw_parts is not and one can build a function equivalent to into_raw_parts from pieces already available on release. I'll update the proposal here accordingly.

(The stabilization of into_raw_parts appears to be blocked by two tiny decisions: 1) method or free function and 2) whether the returned pointer is annotated for non-nullness since such annotation is now possible or not annotated for consistency with from_raw_parts and Box::into_raw_parts.)

… UTF-16 to "must"

hsivonen · 2020-04-28T07:56:08Z

I believe I have now addressed the concerns from last weeks call. Please let me know if I missed something that I should edit in.

docs/string-representation.md

hsivonen force-pushed the stringrepr branch 4 times, most recently from 3d3c5f7 to 0ea2195 Compare April 20, 2020 14:28

hsivonen force-pushed the stringrepr branch from 0ea2195 to 57b0664 Compare April 21, 2020 11:36

hsivonen force-pushed the stringrepr branch from 57b0664 to 05aa454 Compare April 21, 2020 11:44

First iteration of documentation for API boundary string representation

ad54efd

Closes unicode-org#14.

hsivonen force-pushed the stringrepr branch from 05aa454 to ad54efd Compare April 24, 2020 10:34

hsivonen added 3 commits April 28, 2020 10:44

Move string_to_fit policy to the caller

9dbd71b

Upgrade providing version of operation that takes potentially-invalid…

ac4d78f

… UTF-16 to "must"

Point out how common different cases are expected to be

f0833c5

emilio reviewed Apr 30, 2020

View reviewed changes

docs/string-representation.md Outdated Show resolved Hide resolved

emilio reviewed Apr 30, 2020

View reviewed changes

docs/string-representation.md Outdated Show resolved Hide resolved

sffc previously approved these changes Apr 30, 2020

View reviewed changes

emilio reviewed Apr 30, 2020

View reviewed changes

docs/string-representation.md Outdated Show resolved Hide resolved

docs/string-representation.md Outdated Show resolved Hide resolved

Address Emilio's review comments

0513847

hsivonen dismissed sffc’s stale review via 0513847 April 30, 2020 18:43

hsivonen requested a review from sffc May 7, 2020 18:28

sffc approved these changes May 8, 2020

View reviewed changes

hsivonen merged commit 7d0cf06 into unicode-org:master May 8, 2020

hsivonen deleted the stringrepr branch May 8, 2020 07:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First iteration of documentation for API boundary string representation #49

First iteration of documentation for API boundary string representation #49

hsivonen commented Apr 20, 2020

CLAassistant commented Apr 20, 2020 •

edited

Loading

hsivonen commented Apr 20, 2020

hsivonen commented Apr 20, 2020

hsivonen commented Apr 21, 2020

sffc commented Apr 24, 2020 •

edited

Loading

nciric commented Apr 24, 2020 via email

sffc commented Apr 25, 2020

macchiati commented Apr 25, 2020 via email

hsivonen commented Apr 27, 2020

hsivonen commented Apr 27, 2020

sffc commented Apr 27, 2020

zbraniecki commented Apr 27, 2020

hsivonen commented Apr 28, 2020 •

edited

Loading

hsivonen commented Apr 28, 2020

First iteration of documentation for API boundary string representation #49

First iteration of documentation for API boundary string representation #49

Conversation

hsivonen commented Apr 20, 2020

CLAassistant commented Apr 20, 2020 • edited Loading

hsivonen commented Apr 20, 2020

hsivonen commented Apr 20, 2020

hsivonen commented Apr 21, 2020

sffc commented Apr 24, 2020 • edited Loading

nciric commented Apr 24, 2020 via email

sffc commented Apr 25, 2020

macchiati commented Apr 25, 2020 via email

hsivonen commented Apr 27, 2020

hsivonen commented Apr 27, 2020

sffc commented Apr 27, 2020

zbraniecki commented Apr 27, 2020

hsivonen commented Apr 28, 2020 • edited Loading

hsivonen commented Apr 28, 2020

CLAassistant commented Apr 20, 2020 •

edited

Loading

sffc commented Apr 24, 2020 •

edited

Loading

hsivonen commented Apr 28, 2020 •

edited

Loading