Migrate from `&str` to `&[u8]` in `ixdtf` #4918

nekevss · 2024-05-20T12:19:04Z

This is related to tc39/proposal-temporal/issues/2843.

It switches ixdtf's parsing from &str and char to &[u8].

This current version does remove direct parsing for &str, but as core::str supports as_bytes(), there shouldn't be too many incompatibility issues with &str (unless it would be an issue with the FFI layer).

The core changes are to parser/mod.rs's peek_n and slice method along with the grammar changes. There were also some tests added for bad byte sequences, but if there is another case that may need to be tested to ensure we prevent any invalid UTF-8 sequences, feedback would be welcomed.

sffc · 2024-05-20T16:39:31Z

Before I look too much into your UTF-8 math, did you consider using utf8_iter which we already have as a utlity dependency (and authored by contributor @hsivonen)?

https://docs.rs/utf8_iter/latest/utf8_iter/

My expectation is that you could largely keep the code the same, char-based, and basically just swap str::chars with a UTF-8 iterator.

nekevss · 2024-05-20T17:09:32Z

I did not actually. I'll take a glance at the crate. It may be a better option to use.

That being said, assuming the logic here is correct for all cases, this implementation MAY be more performant (depending on compiler optimizations). As it is essentially using the grammar tests to enforce UTF-8 validation at the same time.

Granted, I almost want more invalid byte sequence tests to ensure the logic is correct.

sffc · 2024-05-21T00:50:59Z

In general I see a lot of replacing things like ch == '[' with ch == [b'['] in your code, which may or may not be an improvement because comparing a 4-byte register with another 4-byte register is extremely inexpensive whereas comparing slices requires checking pointer, length, and contents, which the compiler might be able to optimize if it can figure out that the slice is short, but it still seems better to stick with scalar values where possible.

nekevss · 2024-05-24T18:19:04Z

I think this is about ready for review if you're fine with not using utf8_iter.

I was taking a look at utf8_iter, and I'm a little split about using it as far as I'm not sure it's is super compatible with the current cursor, and may require a bit more of a redesign of Cursor as a whole, which was partially why I hadn't used an Iterator prior.

That being said, utf8_iter has fuzzing setup, which is much more robust than any testing on this implementation, so if you'd prefer using ut8_iter, I can definitely try implementing it on a different branch and see how well it integrates 😄

sffc

I would still like to see this use utf8_iter for the following reasons

The crate basically does exactly what you're trying to do in decode_utf8_bytes, and we should try to leverage existing code
As you've noted, utf8_iter is much more well tested
Since this is unsafe code, the bar for thorough testing is much higher, and currently I'm not convinced we have the level of test coverage for edge cases in this function that I would like to see for unsafe code review
Currently ixdtf is completely safe. Any unsafe code, even one line, changes the crate from being safe to requiring an unsafe review, which can get in the way of crate adoption in resource-limited enterprises who require safe code.

utils/ixdtf/src/parsers/mod.rs

sffc · 2024-05-24T18:45:27Z

Basically I think that adding utf8_iter is just a matter of changing your peek_with_info function to invoke the char iterator in that crate. You don't need integration any deeper than that.

nekevss · 2024-05-24T23:13:59Z

Updated! One downside is the loss the UTF8Encoding error. But if that feature is wanted, there might be an argument to add a ErrorReportingUtf8CharsIndices to utf8_iter.

nekevss · 2024-05-30T20:38:47Z

utils/ixdtf/src/parsers/mod.rs

-    /// Creates a new `IXDTFParser` from a provided `&str`.
-    pub fn new(value: &'a str) -> Self {
+    /// Creates a new `IxdtfParser` from a provided `&str`.
+    pub fn new(value: &'a [u8]) -> Self {


Given the talk during the call today. Was there any preference here around using new vs. from_bytes?

Indeed; from_utf8 might be the correct name here.

You can also add from_str (concrete associated function, not FromStr impl) and use it in the docs so you don't need to call .as_bytes() everywhere.

nekevss · 2024-05-30T20:42:04Z

utils/ixdtf/src/parsers/mod.rs

+    /// Peeks the value at `n` and returns the char with its byte length.
+    fn peek_with_info(&self, n: usize) -> Option<(usize, char)> {
+        let mut chars =
+            Utf8CharIndices::new(self.source.get(self.pos..self.source.len()).unwrap_or(&[]));


I mentioned this on a previous comment, but there's no erroring version of Utf8CharIndices currently. Is there any preference towards erroring on invalid utf8?

I think the idea is that ill-formed UTF-8, even if it resolves to the replacement character, won't be valid IXDTF syntax since the replacement character is not valid in any part of the grammar.

Yep! That's the current idea. The replacement character will fail the grammar test currently, but that's a little different than flagging the character as invalid.

For instance, say there is an invalid character in a year. The error thrown currently would be ParserError::DateYear vs. potentially throwing a ParserError::Utf8CharError. Not sure if there'd be a preference there.

I see, hmm.

sffc

Sorry for the slow review!

sffc · 2024-06-03T20:47:30Z

utils/ixdtf/src/parsers/grammar.rs

-pub(crate) const fn is_hyphen(ch: char) -> bool {
+pub(crate) fn is_hyphen(ch: char) -> bool {


Observation: you can make these const again (but there's not really a need to)

Yeah, I noticed those could be flipped back to const, but I figured there wasn't much of a gain either way (might do it though on an upcoming commit to lower the diff)

sffc · 2024-06-03T20:48:49Z

utils/ixdtf/src/parsers/mod.rs

-    /// Creates a new `IXDTFParser` from a provided `&str`.
-    pub fn new(value: &'a str) -> Self {
+    /// Creates a new `IxdtfParser` from a provided `&str`.
+    pub fn new(value: &'a [u8]) -> Self {


Indeed; from_utf8 might be the correct name here.

You can also add from_str (concrete associated function, not FromStr impl) and use it in the docs so you don't need to call .as_bytes() everywhere.

sffc · 2024-06-03T20:51:42Z

utils/ixdtf/src/parsers/mod.rs

-            return Ok(None);
-        };
-        Ok(Some(digit as u8))
+        // Safety: Char digit with a radix of ten must be in the range of a u8


Nit: It's not a safety comment since there is no unsafe

Suggested change

// Safety: Char digit with a radix of ten must be in the range of a u8

// Note: Char digit with a radix of ten must be in the range of a u8

nekevss added 2 commits May 19, 2024 15:52

Migrate from &str to &[u8] for parser

ed228d0

Update new method from &str to &[u8]

e9b9f25

nekevss requested a review from sffc May 20, 2024 14:00

nekevss added 8 commits May 21, 2024 00:48

Switch to char and &[u8] to char conversion

64b3a7c

Fix rustfmt

4f9ade4

Merge branch 'main' into byte-slice-support

9ec746b

Fix utf8 handling decoding

fac4a70

Remove comment on no_std

9fff86b

Fix bounds check on 3-byte and rename byte -> leading_byte

6b7f433

Update get_utf8_offset and decode_utf8_bytes

03cbe3d

Remove unfinished comment

e4256eb

Merge branch 'main' into byte-slice-support

a36e4c4

nekevss marked this pull request as ready for review May 24, 2024 18:22

nekevss requested a review from a team as a code owner May 24, 2024 18:22

sffc reviewed May 24, 2024

View reviewed changes

utils/ixdtf/src/parsers/mod.rs Outdated Show resolved Hide resolved

utils/ixdtf/src/parsers/mod.rs Outdated Show resolved Hide resolved

nekevss added 2 commits May 24, 2024 15:58

Add and implement utf8_iter and remove unsafe code

5a20a37

Remove unneeded Result from peek and related adjustments

2b8dc29

nekevss requested a review from sffc May 24, 2024 23:14

nekevss added 2 commits May 24, 2024 19:37

Merge branch 'main' into byte-slice-support

064d03f

Merge branch 'main' into byte-slice-support

e9cdd46

nekevss mentioned this pull request May 29, 2024

Migrate to the ixdtf crate for datetime parsing boa-dev/temporal#39

Closed

nekevss commented May 30, 2024

View reviewed changes

sffc reviewed Jun 3, 2024

View reviewed changes

nekevss added 2 commits June 3, 2024 20:42

Merge branch 'main' into byte-slice-support

dd68526

Apply review feedback: add from_str and from_utf8

f79eac7

sffc approved these changes Jun 4, 2024

View reviewed changes

sffc merged commit 3a64209 into unicode-org:main Jun 4, 2024
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate from `&str` to `&[u8]` in `ixdtf` #4918

Migrate from `&str` to `&[u8]` in `ixdtf` #4918

nekevss commented May 20, 2024

sffc commented May 20, 2024

nekevss commented May 20, 2024

sffc commented May 21, 2024

nekevss commented May 24, 2024

sffc left a comment

sffc commented May 24, 2024 •

edited

Loading

nekevss commented May 24, 2024

nekevss May 30, 2024

sffc Jun 3, 2024

nekevss May 30, 2024

sffc May 30, 2024

nekevss May 30, 2024

sffc Jun 3, 2024

sffc left a comment

sffc Jun 3, 2024

nekevss Jun 3, 2024

sffc Jun 3, 2024

sffc Jun 3, 2024

		pub(crate) const fn is_hyphen(ch: char) -> bool {
		pub(crate) fn is_hyphen(ch: char) -> bool {

	// Safety: Char digit with a radix of ten must be in the range of a u8
	// Note: Char digit with a radix of ten must be in the range of a u8

Migrate from &str to &[u8] in ixdtf #4918

Migrate from &str to &[u8] in ixdtf #4918

Conversation

nekevss commented May 20, 2024

sffc commented May 20, 2024

nekevss commented May 20, 2024

sffc commented May 21, 2024

nekevss commented May 24, 2024

sffc left a comment

Choose a reason for hiding this comment

sffc commented May 24, 2024 • edited Loading

nekevss commented May 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sffc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Migrate from `&str` to `&[u8]` in `ixdtf` #4918

Migrate from `&str` to `&[u8]` in `ixdtf` #4918

sffc commented May 24, 2024 •

edited

Loading