Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text_format parsing does not correctly handle non-ascii chars? #729

Open
cmyr opened this issue Jun 14, 2024 · 1 comment · May be fixed by #730
Open

text_format parsing does not correctly handle non-ascii chars? #729

cmyr opened this issue Jun 14, 2024 · 1 comment · May be fixed by #730

Comments

@cmyr
Copy link

cmyr commented Jun 14, 2024

It's possible I'm misreading this, but I'm running into issues trying to read text-format protobuf files that contain string literals with non-ascii characters. Reading through the source, the following really does not look correct:

pub fn next_byte_value(&mut self) -> LexerResult<u8> {
match self.next_char()? {
'\\' => {
match self.next_char()? {
'\'' => Ok(b'\''),
'"' => Ok(b'"'),
'\\' => Ok(b'\\'),
'a' => Ok(b'\x07'),
'b' => Ok(b'\x08'),
'f' => Ok(b'\x0c'),
'n' => Ok(b'\n'),
'r' => Ok(b'\r'),
't' => Ok(b'\t'),
'v' => Ok(b'\x0b'),
'x' => {
let d1 = self.next_hex_digit()? as u8;
let d2 = self.next_hex_digit()? as u8;
Ok(((d1 << 4) | d2) as u8)
}
d if d >= '0' && d <= '7' => {
let mut r = d as u8 - b'0';
for _ in 0..2 {
match self.next_octal_digit() {
Err(_) => break,
Ok(d) => r = (r << 3) + d as u8,
}
}
Ok(r)
}
// https://github.com/google/protobuf/issues/4562
// TODO: overflow
c => Ok(c as u8),
}
}
'\n' | '\0' => Err(LexerError::IncorrectInput),
// TODO: check overflow
c => Ok(c as u8),

basically this is consuming chars but only ever returning bytes, and it is converting a char (which represents a unicode scalar value) directly into a u8 which will then be interpreted as utf-8; but outside of ascii the integer value of a char does not correspond to the utf-8 encoding of a char. (for instance the char À has a unicode scalar value of 192, but is encoded as 0xC3, 0x80 in utf-8).

I don't think this is hard to fix; you just need to stay in chars the whole time, and avoid converting to bytes. Given that the text_format input is always valid utf-8 (since you parse from &str, which is always valid utf-8) it should not be possible for a string literal to ever not be valid utf-8.

I'm going to go ahead and write a patch for this and PR it preemptively, since I think it should be relatively trivial; will figure out a test case as well.

@cmyr cmyr linked a pull request Jun 17, 2024 that will close this issue
@simoncozens
Copy link

Hitting the same problem, and Colin's parse-unicode-strings branch works for me. Please merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants