Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate from &str to &[u8] in ixdtf #4918

Merged
merged 17 commits into from
Jun 4, 2024
Merged

Conversation

nekevss
Copy link
Contributor

@nekevss nekevss commented May 20, 2024

This is related to tc39/proposal-temporal/issues/2843.

It switches ixdtf's parsing from &str and char to &[u8].

This current version does remove direct parsing for &str, but as core::str supports as_bytes(), there shouldn't be too many incompatibility issues with &str (unless it would be an issue with the FFI layer).

The core changes are to parser/mod.rs's peek_n and slice method along with the grammar changes. There were also some tests added for bad byte sequences, but if there is another case that may need to be tested to ensure we prevent any invalid UTF-8 sequences, feedback would be welcomed.

@nekevss nekevss requested a review from sffc May 20, 2024 14:00
@sffc
Copy link
Member

sffc commented May 20, 2024

Before I look too much into your UTF-8 math, did you consider using utf8_iter which we already have as a utlity dependency (and authored by contributor @hsivonen)?

https://docs.rs/utf8_iter/latest/utf8_iter/

My expectation is that you could largely keep the code the same, char-based, and basically just swap str::chars with a UTF-8 iterator.

@nekevss
Copy link
Contributor Author

nekevss commented May 20, 2024

I did not actually. I'll take a glance at the crate. It may be a better option to use.

That being said, assuming the logic here is correct for all cases, this implementation MAY be more performant (depending on compiler optimizations). As it is essentially using the grammar tests to enforce UTF-8 validation at the same time.

Granted, I almost want more invalid byte sequence tests to ensure the logic is correct.

@sffc
Copy link
Member

sffc commented May 21, 2024

In general I see a lot of replacing things like ch == '[' with ch == [b'['] in your code, which may or may not be an improvement because comparing a 4-byte register with another 4-byte register is extremely inexpensive whereas comparing slices requires checking pointer, length, and contents, which the compiler might be able to optimize if it can figure out that the slice is short, but it still seems better to stick with scalar values where possible.

@nekevss
Copy link
Contributor Author

nekevss commented May 24, 2024

I think this is about ready for review if you're fine with not using utf8_iter.

I was taking a look at utf8_iter, and I'm a little split about using it as far as I'm not sure it's is super compatible with the current cursor, and may require a bit more of a redesign of Cursor as a whole, which was partially why I hadn't used an Iterator prior.

That being said, utf8_iter has fuzzing setup, which is much more robust than any testing on this implementation, so if you'd prefer using ut8_iter, I can definitely try implementing it on a different branch and see how well it integrates 😄

@nekevss nekevss marked this pull request as ready for review May 24, 2024 18:22
@nekevss nekevss requested a review from a team as a code owner May 24, 2024 18:22
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like to see this use utf8_iter for the following reasons

  1. The crate basically does exactly what you're trying to do in decode_utf8_bytes, and we should try to leverage existing code
  2. As you've noted, utf8_iter is much more well tested
  3. Since this is unsafe code, the bar for thorough testing is much higher, and currently I'm not convinced we have the level of test coverage for edge cases in this function that I would like to see for unsafe code review
  4. Currently ixdtf is completely safe. Any unsafe code, even one line, changes the crate from being safe to requiring an unsafe review, which can get in the way of crate adoption in resource-limited enterprises who require safe code.

utils/ixdtf/src/parsers/mod.rs Outdated Show resolved Hide resolved
utils/ixdtf/src/parsers/mod.rs Outdated Show resolved Hide resolved
@sffc
Copy link
Member

sffc commented May 24, 2024

Basically I think that adding utf8_iter is just a matter of changing your peek_with_info function to invoke the char iterator in that crate. You don't need integration any deeper than that.

@nekevss
Copy link
Contributor Author

nekevss commented May 24, 2024

Updated! One downside is the loss the UTF8Encoding error. But if that feature is wanted, there might be an argument to add a ErrorReportingUtf8CharsIndices to utf8_iter.

@nekevss nekevss requested a review from sffc May 24, 2024 23:14
/// Creates a new `IXDTFParser` from a provided `&str`.
pub fn new(value: &'a str) -> Self {
/// Creates a new `IxdtfParser` from a provided `&str`.
pub fn new(value: &'a [u8]) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the talk during the call today. Was there any preference here around using new vs. from_bytes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed; from_utf8 might be the correct name here.

You can also add from_str (concrete associated function, not FromStr impl) and use it in the docs so you don't need to call .as_bytes() everywhere.

/// Peeks the value at `n` and returns the char with its byte length.
fn peek_with_info(&self, n: usize) -> Option<(usize, char)> {
let mut chars =
Utf8CharIndices::new(self.source.get(self.pos..self.source.len()).unwrap_or(&[]));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this on a previous comment, but there's no erroring version of Utf8CharIndices currently. Is there any preference towards erroring on invalid utf8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea is that ill-formed UTF-8, even if it resolves to the replacement character, won't be valid IXDTF syntax since the replacement character is not valid in any part of the grammar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! That's the current idea. The replacement character will fail the grammar test currently, but that's a little different than flagging the character as invalid.

For instance, say there is an invalid character in a year. The error thrown currently would be ParserError::DateYear vs. potentially throwing a ParserError::Utf8CharError. Not sure if there'd be a preference there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, hmm.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the slow review!

pub(crate) const fn is_hyphen(ch: char) -> bool {
pub(crate) fn is_hyphen(ch: char) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: you can make these const again (but there's not really a need to)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I noticed those could be flipped back to const, but I figured there wasn't much of a gain either way (might do it though on an upcoming commit to lower the diff)

/// Creates a new `IXDTFParser` from a provided `&str`.
pub fn new(value: &'a str) -> Self {
/// Creates a new `IxdtfParser` from a provided `&str`.
pub fn new(value: &'a [u8]) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed; from_utf8 might be the correct name here.

You can also add from_str (concrete associated function, not FromStr impl) and use it in the docs so you don't need to call .as_bytes() everywhere.

return Ok(None);
};
Ok(Some(digit as u8))
// Safety: Char digit with a radix of ten must be in the range of a u8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's not a safety comment since there is no unsafe

Suggested change
// Safety: Char digit with a radix of ten must be in the range of a u8
// Note: Char digit with a radix of ten must be in the range of a u8

@sffc sffc merged commit 3a64209 into unicode-org:main Jun 4, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants