text_format parsing does not correctly handle non-ascii chars? #729

cmyr · 2024-06-14T20:14:21Z

It's possible I'm misreading this, but I'm running into issues trying to read text-format protobuf files that contain string literals with non-ascii characters. Reading through the source, the following really does not look correct:

rust-protobuf/protobuf-support/src/lexer/lexer_impl.rs

Lines 443 to 479 in 16c9dc5

    
           pub fn next_byte_value(&mut self) -> LexerResult<u8> { 
        
               match self.next_char()? { 
        
                   '\\' => { 
        
                       match self.next_char()? { 
        
                           '\'' => Ok(b'\''), 
        
                           '"' => Ok(b'"'), 
        
                           '\\' => Ok(b'\\'), 
        
                           'a' => Ok(b'\x07'), 
        
                           'b' => Ok(b'\x08'), 
        
                           'f' => Ok(b'\x0c'), 
        
                           'n' => Ok(b'\n'), 
        
                           'r' => Ok(b'\r'), 
        
                           't' => Ok(b'\t'), 
        
                           'v' => Ok(b'\x0b'), 
        
                           'x' => { 
        
                               let d1 = self.next_hex_digit()? as u8; 
        
                               let d2 = self.next_hex_digit()? as u8; 
        
                               Ok(((d1 << 4) | d2) as u8) 
        
                           } 
        
                           d if d >= '0' && d <= '7' => { 
        
                               let mut r = d as u8 - b'0'; 
        
                               for _ in 0..2 { 
        
                                   match self.next_octal_digit() { 
        
                                       Err(_) => break, 
        
                                       Ok(d) => r = (r << 3) + d as u8, 
        
                                   } 
        
                               } 
        
                               Ok(r) 
        
                           } 
        
                           // https://github.com/google/protobuf/issues/4562 
        
                           // TODO: overflow 
        
                           c => Ok(c as u8), 
        
                       } 
        
                   } 
        
                   '\n' | '\0' => Err(LexerError::IncorrectInput), 
        
                   // TODO: check overflow 
        
                   c => Ok(c as u8),

basically this is consuming chars but only ever returning bytes, and it is converting a char (which represents a unicode scalar value) directly into a u8 which will then be interpreted as utf-8; but outside of ascii the integer value of a char does not correspond to the utf-8 encoding of a char. (for instance the char À has a unicode scalar value of 192, but is encoded as 0xC3, 0x80 in utf-8).

I don't think this is hard to fix; you just need to stay in chars the whole time, and avoid converting to bytes. Given that the text_format input is always valid utf-8 (since you parse from &str, which is always valid utf-8) it should not be possible for a string literal to ever not be valid utf-8.

I'm going to go ahead and write a patch for this and PR it preemptively, since I think it should be relatively trivial; will figure out a test case as well.

The text was updated successfully, but these errors were encountered:

simoncozens · 2024-09-03T07:44:08Z

Hitting the same problem, and Colin's parse-unicode-strings branch works for me. Please merge!

cmyr linked a pull request Jun 17, 2024 that will close this issue

[text-format] Fix parsing of string literals #730

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text_format parsing does not correctly handle non-ascii chars? #729

text_format parsing does not correctly handle non-ascii chars? #729

cmyr commented Jun 14, 2024

simoncozens commented Sep 3, 2024

text_format parsing does not correctly handle non-ascii chars? #729

text_format parsing does not correctly handle non-ascii chars? #729

Comments

cmyr commented Jun 14, 2024

simoncozens commented Sep 3, 2024