-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First iteration of documentation for API boundary string representation #49
Conversation
3d3c5f7
to
0ea2195
Compare
Hmm. Maybe I should replace the appender function last resort with an ICU4X-allocated |
Hmm. Or maybe there isn't even a need for a |
Updated with guidance that aims to allocate once in the best case, with example of the case where there's no good worst-case estimation, and with a C++ wrapper sample. |
Thanks for this! Looks great overall. One thought to start: It's common that ICU and 402 functions either accept string input (collation, segmentation, parsing) or return string output (formatting), but not necessarily both. Although there are cases of functions that take strings as both input and output (you have case mapping as an example), ICU functions frequently map between strings and machine types (doubles, integers, etc). Therefore, I'm curious how this affects the memory allocation section where you discuss the need to make a guess at the worst-case output string size. |
One that takes strings and returns them is message format API (but that one
is the hard on many fronts, including Wasm re-entry).
пет, 24. апр 2020. у 09:56 Shane F. Carr <notifications@github.com> је
написао/ла:
… Thanks for this! Looks great overall.
One thought to start: It's common that ICU and 402 functions either accept
string input (collation, segmentation, parsing) or return string output
(formatting), but not necessarily both. Although there are cases of
functions that take strings as both input and output (you have case mapping
as an example), ICU functions frequently maps between strings and machine
types (doubles, integers, etc). Therefore, I'm curious how this affects the
memory allocation section where you discuss the need to make a guess at the
worst-case output string size.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#49 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7GEKRTK7Y76TG4ITELVVLROHADBANCNFSM4MMPOR7Q>
.
|
I think it would be relevant to this doc to discuss a scenario such as the formatting APIs where the length of the output string is known after some logic is performed. For example, could ICU4X "ask" the host for a buffer of a certain size via a function pointer? Is that conventional? Here's an example: type FnAlloc<T> = dyn Fn(usize) -> Box<[T]>;
fn format(_input: f64, alloc: &FnAlloc<u8>) -> Box<[u8]> {
// Dummy implementation
let mut result = alloc(3);
result[0] = 0x41;
result[1] = 0x42;
result[2] = 0x43;
return result;
}
fn alloc_impl(len: usize) -> Box<[u8]> {
vec![0u8; len].into_boxed_slice()
}
fn main() {
let result_box = format(1.0, &alloc_impl);
let result_string = String::from_utf8(result_box.to_vec()).unwrap();
println!("{}", result_string);
} |
Often the cost of computing the buffer size ahead of time is a substantial
proportion of the cost of filling the buffer, which makes the total cost
not very efficient. However, it is often not too much work to compute a
maximum size, and report the size actually used afterwards. What about that?
Mark
…On Fri, Apr 24, 2020 at 6:01 PM Shane F. Carr ***@***.***> wrote:
I think it would be relevant to this doc to discuss a scenario such as the
formatting APIs where the length of the output string is known after some
logic is performed. For example, could ICU4X "ask" the host for a buffer of
a certain size via a function pointer? Is that conventional?
Here's an example:
type FnAlloc<T> = dyn Fn(usize) -> Box<[T]>;
fn format(_input: f64, alloc: &FnAlloc<u8>) -> Box<[u8]> {
// Dummy implementation
let mut result = alloc(3);
result[0] = 0x41;
result[1] = 0x42;
result[2] = 0x43;
return result;
}
fn alloc_impl(len: usize) -> Box<[u8]> {
vec![0u8; len].into_boxed_slice()
}
fn main() {
let result_box = format(1.0, &alloc_impl);
let result_string = String::from_utf8(result_box.to_vec()).unwrap();
println!("{}", result_string);
}
https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=eb2248890ed1a746ba3155b7a31c62c9
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#49 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMEZL4DBVHYCH4ZHGT3ROIZAHANCNFSM4MMPOR7Q>
.
|
Per discussion on the call, I'll add clarifying notes that we'd end up using what the document now positions as the last resort case way more often than the document now implies.
So ICU4X would call the allocator, but what the allocator is would be caller-customatizable instead of being a matter of customizing the global allocator for Rust code at build time. Is that use case something that would actually be useful compared to either
That is, is there a use case worth supporting where you'd want to have a different allocator for buffers ICU4X and for other Rust code?
This is what the slice-writing APIs would do. Additionally, over the weekend, I figured out how to use Rust I'll add a new changeset to this PR. |
Unfortunately, the API I had in mind is nightly-only still. |
I was mainly thinking about the FFI use case. This would be a way for the FFI client to allocate a host-side buffer and have ICU4X fill it in with the correct character encoding. Otherwise, if we allocate the string on the Rust side, it's the host's responsibility to copy the contents out of the buffer to its own buffer. Maybe that pattern is okay, but I wanted to pose the question. |
We could try to check with the Rust team if this API is on the path to stabilization. If it is and we can hope to have it in Rust stable within 6 months, I wouldn't mind designing with that in mind. |
That would still require e.g. a C++ caller to have a custom string class that can adopt a buffer allocated externally to the string class. (As opposed to the string class insisting on managing its buffer strictly on its own.) If the Rust code is built with the same allocator as the host code, the host code can adopt a buffer that was allocated by ICU4X just as if it came directly from the usual host allocator. If the Rust code is built with a different allocator than the host code, it's still possible to have a custom host string class that adopts the buffer if it can use the ICU4X-provided freeing function and can pass not only the pointer but also the capacity to the freeing function. So I'm skeptical of the value of complicating the ICU4X API by making the allocator selectable on the call sites. (Rust supports allocators that, unlike C allocators, aren't capable of looking up the allocation size given a pointer previously returned by the allocator and that instead trust the application to pass back the allocation size. Such an allocator actually exists for Wasm.)
I checked on the issue, but my check was marked resolved as noise, so you need to expand it to see it. However, by checking, I did learn something that I had missed: while (The stabilization of |
I believe I have now addressed the concerns from last weeks call. Please let me know if I missed something that I should edit in. |
Closes #14.