Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add writeable::write_or_ref and use it in icu_locale_core #5738

Merged
merged 21 commits into from
Oct 29, 2024

Conversation

sffc
Copy link
Member

@sffc sffc commented Oct 25, 2024

Follow-up to #5727
Depends on #5737
Replaces #5733
See #2748

I am adding an additional utility function to the writeable crate with a similar architecture to writeable::cmp_bytes. This one, writeable::write_or_ref, makes certain normalization operations more efficient.

@sffc
Copy link
Member Author

sffc commented Oct 25, 2024

Please note, I wrote unsafe code in 68c2b97 and then re-wrote it in d6485c8.

This PR is small enough that reviewing it at once is probably more efficient.

@sffc
Copy link
Member Author

sffc commented Oct 25, 2024

/// Cow::Borrowed("Hello, Alice!")
/// ));
///
/// // Junk at the end is ignored:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// // Junk at the end is ignored:
/// // Borrowing can use a prefix:

/// Cow::Owned(s) if s == "this is uppercase"
/// ));
/// ```
pub fn write_or_ref<'a>(writeable: &impl Writeable, reference_bytes: &'a [u8]) -> Cow<'a, str> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with this name. Usually methods called write write to one of the arguments, and don't do any allocation themselves, like this does. I think write_to_string_or_borrow would be a more accurate name, and would align more with write_to_string, which is meant to use it.

Comment on lines 58 to 75
let owned: &mut String = match self {
SliceOrString::Slice(slice) => {
if slice.try_push(other) {
return Ok(());
}
// We failed to match. Convert to owned, put it in the field,
// and get it out again.
let valid_str = slice.validated_as_str();
let owned = String::from(valid_str);
*self = SliceOrString::String(owned);
let SliceOrString::String(owned) = self else {
unreachable!()
};
owned
}
SliceOrString::String(owned) => owned,
};
owned.write_str(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is jumping through a lot of hoops to not duplicate line 75. I think the code would be a lot easier if you did that:

Suggested change
let owned: &mut String = match self {
SliceOrString::Slice(slice) => {
if slice.try_push(other) {
return Ok(());
}
// We failed to match. Convert to owned, put it in the field,
// and get it out again.
let valid_str = slice.validated_as_str();
let owned = String::from(valid_str);
*self = SliceOrString::String(owned);
let SliceOrString::String(owned) = self else {
unreachable!()
};
owned
}
SliceOrString::String(owned) => owned,
};
owned.write_str(other)
match self {
SliceOrString::Slice(slice) => {
if !slice.try_push(other) {
// We failed to match. Convert to owned.
let valid_str = slice.validated_as_str();
let owned = format!("{valid_str}{other});
*self = SliceOrString::String(owned);
Ok(())
}
SliceOrString::String(owned) => owned.write_str(other),
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a lot of hoops: it is a standard "base case" coding style. That said, I think your code is more readable, except I don't like using format!. I'll push up something similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this Rust program:

#[inline(never)]
fn string_from_pieces_format(a: &str, b: &str) -> String {
    std::format!("{a}{b}")
}

#[inline(never)]
fn string_from_pieces_manual(a: &str, b: &str) -> String {
    let mut s = String::with_capacity(a.len() + b.len());
    s.push_str(a);
    s.push_str(b);
    s
}

fn main() {
    let s1 = string_from_pieces_format("hello", "world");
    let s2 = string_from_pieces_manual("hello", "world");
    println!("{s1} {s2}");
}

Here is the ASM (https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=dff4d0ae823abc03d88fad4b84b56a7d):

std::rt::lang_start:
	pushq	%rax
	movl	%ecx, %r8d
	movq	%rdx, %rcx
	movq	%rsi, %rdx
	movq	%rdi, (%rsp)
	leaq	.L__unnamed_1(%rip), %rsi
	movq	%rsp, %rdi
	callq	*std::rt::lang_start_internal@GOTPCREL(%rip)
	popq	%rcx
	retq

std::rt::lang_start::{{closure}}:
	pushq	%rax
	movq	(%rdi), %rdi
	callq	std::sys::backtrace::__rust_begin_short_backtrace
	xorl	%eax, %eax
	popq	%rcx
	retq

std::sys::backtrace::__rust_begin_short_backtrace:
	pushq	%rax
	callq	*%rdi
	#APP
	#NO_APP
	popq	%rax
	retq

<&T as core::fmt::Display>::fmt:
	movq	%rsi, %rdx
	movq	(%rdi), %rax
	movq	8(%rdi), %rsi
	movq	%rax, %rdi
	jmpq	*<str as core::fmt::Display>::fmt@GOTPCREL(%rip)

core::ops::function::FnOnce::call_once{{vtable.shim}}:
	pushq	%rax
	movq	(%rdi), %rdi
	callq	std::sys::backtrace::__rust_begin_short_backtrace
	xorl	%eax, %eax
	popq	%rcx
	retq

<alloc::string::String as core::fmt::Display>::fmt:
	movq	%rsi, %rdx
	movq	8(%rdi), %rax
	movq	16(%rdi), %rsi
	movq	%rax, %rdi
	jmpq	*<str as core::fmt::Display>::fmt@GOTPCREL(%rip)

playground::string_from_pieces_format:
	subq	$120, %rsp
	leaq	.L__unnamed_2(%rip), %rax
	movq	%rax, 8(%rsp)
	movq	$5, 16(%rsp)
	leaq	.L__unnamed_3(%rip), %rax
	movq	%rax, 24(%rsp)
	movq	$5, 32(%rsp)
	leaq	8(%rsp), %rax
	movq	%rax, 40(%rsp)
	leaq	<&T as core::fmt::Display>::fmt(%rip), %rax
	movq	%rax, 48(%rsp)
	leaq	24(%rsp), %rcx
	movq	%rcx, 56(%rsp)
	movq	%rax, 64(%rsp)
	leaq	.L__unnamed_4(%rip), %rax
	movq	%rax, 72(%rsp)
	movq	$2, 80(%rsp)
	movq	$0, 104(%rsp)
	leaq	40(%rsp), %rax
	movq	%rax, 88(%rsp)
	movq	$2, 96(%rsp)
	leaq	72(%rsp), %rsi
	callq	*alloc::fmt::format::format_inner@GOTPCREL(%rip)
	addq	$120, %rsp
	retq

playground::string_from_pieces_manual:
	pushq	%rbx
	movq	%rdi, %rbx
	movq	__rust_no_alloc_shim_is_unstable@GOTPCREL(%rip), %rax
	movzbl	(%rax), %eax
	movl	$10, %edi
	movl	$1, %esi
	callq	*__rust_alloc@GOTPCREL(%rip)
	testq	%rax, %rax
	je	.LBB7_2
	movb	$111, 4(%rax)
	movl	$1819043176, (%rax)
	movl	$1819438967, 5(%rax)
	movb	$100, 9(%rax)
	movq	$10, (%rbx)
	movq	%rax, 8(%rbx)
	movq	$10, 16(%rbx)
	popq	%rbx
	retq

.LBB7_2:
	movl	$1, %edi
	movl	$10, %esi
	callq	*alloc::raw_vec::handle_error@GOTPCREL(%rip)

playground::main:
	pushq	%r14
	pushq	%rbx
	subq	$136, %rsp
	leaq	8(%rsp), %rbx
	movq	%rbx, %rdi
	callq	playground::string_from_pieces_format
	leaq	32(%rsp), %r14
	movq	%r14, %rdi
	callq	playground::string_from_pieces_manual
	movq	%rbx, 56(%rsp)
	leaq	<alloc::string::String as core::fmt::Display>::fmt(%rip), %rax
	movq	%rax, 64(%rsp)
	movq	%r14, 72(%rsp)
	movq	%rax, 80(%rsp)
	leaq	.L__unnamed_5(%rip), %rax
	movq	%rax, 88(%rsp)
	movq	$3, 96(%rsp)
	movq	$0, 120(%rsp)
	leaq	56(%rsp), %rax
	movq	%rax, 104(%rsp)
	movq	$2, 112(%rsp)
	leaq	88(%rsp), %rdi
	callq	*std::io::stdio::_print@GOTPCREL(%rip)
	movq	32(%rsp), %rsi
	testq	%rsi, %rsi
	je	.LBB8_4
	movq	40(%rsp), %rdi
	movl	$1, %edx
	callq	*__rust_dealloc@GOTPCREL(%rip)

.LBB8_4:
	movq	8(%rsp), %rsi
	testq	%rsi, %rsi
	je	.LBB8_6
	movq	16(%rsp), %rdi
	movl	$1, %edx
	callq	*__rust_dealloc@GOTPCREL(%rip)

.LBB8_6:
	addq	$136, %rsp
	popq	%rbx
	popq	%r14
	retq
	movq	%rax, %rbx
	movq	32(%rsp), %rsi
	testq	%rsi, %rsi
	je	.LBB8_8
	movq	40(%rsp), %rdi
	movl	$1, %edx
	callq	*__rust_dealloc@GOTPCREL(%rip)
	jmp	.LBB8_8
	movq	%rax, %rbx

.LBB8_8:
	movq	8(%rsp), %rsi
	testq	%rsi, %rsi
	je	.LBB8_10
	movq	16(%rsp), %rdi
	movl	$1, %edx
	callq	*__rust_dealloc@GOTPCREL(%rip)

.LBB8_10:
	movq	%rbx, %rdi
	callq	_Unwind_Resume@PLT

main:
	pushq	%rax
	movq	%rsi, %rcx
	movslq	%edi, %rdx
	leaq	playground::main(%rip), %rax
	movq	%rax, (%rsp)
	leaq	.L__unnamed_1(%rip), %rsi
	movq	%rsp, %rdi
	xorl	%r8d, %r8d
	callq	*std::rt::lang_start_internal@GOTPCREL(%rip)
	popq	%rcx
	retq

.L__unnamed_1:
	.asciz	"\000\000\000\000\000\000\000\000\b\000\000\000\000\000\000\000\b\000\000\000\000\000\000"
	.quad	core::ops::function::FnOnce::call_once{{vtable.shim}}
	.quad	std::rt::lang_start::{{closure}}
	.quad	std::rt::lang_start::{{closure}}

.L__unnamed_4:
	.quad	1
	.zero	8
	.quad	1
	.zero	8

.L__unnamed_2:
	.ascii	"hello"

.L__unnamed_3:
	.ascii	"world"

.L__unnamed_6:
	.byte	32

.L__unnamed_7:
	.byte	10

.L__unnamed_5:
	.quad	1
	.zero	8
	.quad	.L__unnamed_6
	.asciz	"\001\000\000\000\000\000\000"
	.quad	.L__unnamed_7
	.asciz	"\001\000\000\000\000\000\000"

So it appears avoiding format! is worthwhile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot parse that atm. Just make sure to reserve capacity once.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.

No strong opinion on naming, weakly agree that write_or_ref isn't ideal, but don't have antyhing better.

Needs more structured safety comments but I tried to help with that.

It's unfortunate this needs unsafe at all, the str version of this would be purely safe. But we do want Potentially versions.


impl<'a> PartiallyValidatedUtf8<'a> {
fn new(slice: &'a [u8]) -> Self {
Self { slice, offset: 0 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: safety comment:

/// Safety: Field invariants maintained here trivially:
///   1. The offset 0 is ≤ all possible lengths of slice
///   2. The slice contains nothing up to the offset zero

We typically want similar safety comments for whenever one of the fields is mutated.

// Safety Invariant: `valid_str` is valid UTF-8, and we are
// incrementing the offset to cover bytes equal to `valid_str`
self.offset += valid_str.len();
true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say "Safety: Field invariants maintained here by..." and use a list like above

/// Return the validated portion as `&str`.
fn validated_as_str(&self) -> &'a str {
debug_assert!(self.offset <= self.slice.len());
// Safety: self.offset is a valid end index in a range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd suffix both of these comments with (from field invariant), or somethign similar, basically indicating where the invariant comes from

}

/// Writes the contents of a `Writeable` to a string, returning a reference
/// to a slice if it matches, and allocating a string otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// to a slice if it matches, and allocating a string otherwise.
/// to a slice if it matches the provided reference bytes, and allocating a string otherwise.

@sffc
Copy link
Member Author

sffc commented Oct 28, 2024

I think I addressed all comments other than the function name. Happy to entertain suggestions.

Bikeshed:

  • writeable::write_or_ref (current)
  • writeable::write_to_string_or_borrow (@robertbastian suggestion)
  • writeable::to_string_or_borrow
  • writeable::to_string_or_ref
  • writeable::write_or_borrow
  • writeable::or_borrow
  • writeable::or_ref

A problem with write_to_string_or_borrow is that write_to_string is already defined to borrow when able. I've never been super happy with the name of that trait function because of this.

@sffc
Copy link
Member Author

sffc commented Oct 28, 2024

It's unfortunate this needs unsafe at all, the str version of this would be purely safe. But we do want Potentially versions.

I would argue that the &[u8] version is superior to an all-&str version not only because of PotentialUtf8 but also because (1) it could be used when you have some other blob data you can reference, like postcard or something, and (2) the &str functions are a little annoying because they all return Options due to code point boundaries, so I think we'd need to have an unwrap or get_unchecked anyway.

@robertbastian
Copy link
Member

A problem with write_to_string_or_borrow is that write_to_string is already defined to borrow when able. I've never been super happy with the name of that trait function because of this.

Yeah, write_to_string is defined to borrow when able, and writeable::write_to_string_or_borrow is how that can be implemented.

Copy link

dpulls bot commented Oct 28, 2024

🎉 All dependencies have been resolved !

@sffc sffc requested a review from Manishearth October 28, 2024 21:47
Manishearth
Manishearth previously approved these changes Oct 29, 2024
@sffc
Copy link
Member Author

sffc commented Oct 29, 2024

I'll wait to merge this until we discuss the name.

@sffc sffc added the discuss-priority Discuss at the next ICU4X meeting label Oct 29, 2024
@sffc
Copy link
Member Author

sffc commented Oct 29, 2024

Naming choice: writeable::to_string_or_borrow

LGTM: @Manishearth @robertbastian (@sffc)

@sffc sffc merged commit 8344161 into unicode-org:main Oct 29, 2024
28 checks passed
@sffc sffc deleted the write-or-ref branch October 29, 2024 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss-priority Discuss at the next ICU4X meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants