-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate from &str
to &[u8]
in ixdtf
#4918
Conversation
Before I look too much into your UTF-8 math, did you consider using https://docs.rs/utf8_iter/latest/utf8_iter/ My expectation is that you could largely keep the code the same, |
I did not actually. I'll take a glance at the crate. It may be a better option to use. That being said, assuming the logic here is correct for all cases, this implementation MAY be more performant (depending on compiler optimizations). As it is essentially using the grammar tests to enforce UTF-8 validation at the same time. Granted, I almost want more invalid byte sequence tests to ensure the logic is correct. |
In general I see a lot of replacing things like |
I think this is about ready for review if you're fine with not using I was taking a look at That being said, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still like to see this use utf8_iter
for the following reasons
- The crate basically does exactly what you're trying to do in
decode_utf8_bytes
, and we should try to leverage existing code - As you've noted,
utf8_iter
is much more well tested - Since this is
unsafe
code, the bar for thorough testing is much higher, and currently I'm not convinced we have the level of test coverage for edge cases in this function that I would like to see forunsafe
code review - Currently
ixdtf
is completely safe. Anyunsafe
code, even one line, changes the crate from being safe to requiring an unsafe review, which can get in the way of crate adoption in resource-limited enterprises who require safe code.
Basically I think that adding |
Updated! One downside is the loss the |
utils/ixdtf/src/parsers/mod.rs
Outdated
/// Creates a new `IXDTFParser` from a provided `&str`. | ||
pub fn new(value: &'a str) -> Self { | ||
/// Creates a new `IxdtfParser` from a provided `&str`. | ||
pub fn new(value: &'a [u8]) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the talk during the call today. Was there any preference here around using new
vs. from_bytes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed; from_utf8
might be the correct name here.
You can also add from_str
(concrete associated function, not FromStr impl) and use it in the docs so you don't need to call .as_bytes()
everywhere.
/// Peeks the value at `n` and returns the char with its byte length. | ||
fn peek_with_info(&self, n: usize) -> Option<(usize, char)> { | ||
let mut chars = | ||
Utf8CharIndices::new(self.source.get(self.pos..self.source.len()).unwrap_or(&[])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned this on a previous comment, but there's no erroring version of Utf8CharIndices
currently. Is there any preference towards erroring on invalid utf8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea is that ill-formed UTF-8, even if it resolves to the replacement character, won't be valid IXDTF syntax since the replacement character is not valid in any part of the grammar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! That's the current idea. The replacement character will fail the grammar test currently, but that's a little different than flagging the character as invalid.
For instance, say there is an invalid character in a year. The error thrown currently would be ParserError::DateYear
vs. potentially throwing a ParserError::Utf8CharError
. Not sure if there'd be a preference there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, hmm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the slow review!
utils/ixdtf/src/parsers/grammar.rs
Outdated
pub(crate) const fn is_hyphen(ch: char) -> bool { | ||
pub(crate) fn is_hyphen(ch: char) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Observation: you can make these const
again (but there's not really a need to)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I noticed those could be flipped back to const
, but I figured there wasn't much of a gain either way (might do it though on an upcoming commit to lower the diff)
utils/ixdtf/src/parsers/mod.rs
Outdated
/// Creates a new `IXDTFParser` from a provided `&str`. | ||
pub fn new(value: &'a str) -> Self { | ||
/// Creates a new `IxdtfParser` from a provided `&str`. | ||
pub fn new(value: &'a [u8]) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed; from_utf8
might be the correct name here.
You can also add from_str
(concrete associated function, not FromStr impl) and use it in the docs so you don't need to call .as_bytes()
everywhere.
utils/ixdtf/src/parsers/mod.rs
Outdated
return Ok(None); | ||
}; | ||
Ok(Some(digit as u8)) | ||
// Safety: Char digit with a radix of ten must be in the range of a u8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It's not a safety comment since there is no unsafe
// Safety: Char digit with a radix of ten must be in the range of a u8 | |
// Note: Char digit with a radix of ten must be in the range of a u8 |
This is related to tc39/proposal-temporal/issues/2843.
It switches
ixdtf
's parsing from&str
andchar
to&[u8]
.This current version does remove direct parsing for
&str
, but ascore::str
supportsas_bytes()
, there shouldn't be too many incompatibility issues with&str
(unless it would be an issue with the FFI layer).The core changes are to
parser/mod.rs
'speek_n
andslice
method along with the grammar changes. There were also some tests added for bad byte sequences, but if there is another case that may need to be tested to ensure we prevent any invalid UTF-8 sequences, feedback would be welcomed.