How can I build a regex pattern with sequences of non utf-8 bytes? #1253

johomo · 2025-02-13T12:12:13Z

johomo
Feb 13, 2025

First of all, thank you all for maintaining such a great crate 🫶

Now, straight to the point.

The signature of the regex::bytes::Regex::new takes a regular expression as a &str. Also, str must be valid utf-8. This effecively means that Regex cannot be used to find a sequence of utf8-invalid bytes in a haystack of bytes.

As an example, the following snippet attempts to find a needle b"\x9c\xe5" (which is not valid utf8) in a haystack of bytes.

use regex::bytes::RegexBuilder;

fn main() {
    let hay: &[u8] = b"The following bytes are not UTF8 valid: \x9c\xe5";
    
    let pattern_as_bytes: &[u8] = b"\x9c\xe5";
    
    let re = RegexBuilder::new(pattern_as_bytes).unicode(false).build().expect("Invalid regex pattern");
    //       ----------------- ^^^^^^^^^^^^^^^^^ expected `&str`, found `&[u8]`
    //       |
    //       arguments to this function are incorrect

    // Find all occurrences of the pattern in the text
    for mat in re.find_iter(hay) {
        println!("Found match at: {:?}", mat);
    }
}

For the sake of completness, this exact example can be accomplished with memchr::memmem::find_iter.

My question is: What is the motivation to only allow patterns as instances of &str?

To be honest, I'm not sure if it makes any sense to allow patterns as bytes. For instance, how would you write a pattern such as "match either bytes \x9c\xe5 or bytes b"bar""? Would it be br"(\x9c\xe5|bar)"? Does this even make sense?

I feel I am missing something big. That's why I'm really interested in your point of view as experts in this topic.

Thanks a lot for your time.

Answered by BurntSushi

Feb 13, 2025

This effecively means that Regex cannot be used to find a sequence of utf8-invalid bytes in a haystack of bytes.

That's incorrect. Here's a counter-example:

use regex::bytes::Regex;

fn main() {
    let haystack = b"foo bar\xFF baz";
    let re = Regex::new(r"(?-u)\xFF").unwrap();
    assert_eq!(re.find(haystack).map(|r| r.range()), Some(7..8));
}

Playground link.

I feel I am missing something big.

You might have missed this section in the docs: https://docs.rs/regex/latest/regex/bytes/index.html#syntax

The relevant sections there are:

The u flag can be disabled even when disabling it might cause the regex to match invalid UTF-8. When the u flag is disabled, the regex is said to be i…

View full answer

BurntSushi · 2025-02-13T13:21:33Z

BurntSushi
Feb 13, 2025
Maintainer

This effecively means that Regex cannot be used to find a sequence of utf8-invalid bytes in a haystack of bytes.

That's incorrect. Here's a counter-example:

use regex::bytes::Regex;

fn main() {
    let haystack = b"foo bar\xFF baz";
    let re = Regex::new(r"(?-u)\xFF").unwrap();
    assert_eq!(re.find(haystack).map(|r| r.range()), Some(7..8));
}

Playground link.

I feel I am missing something big.

You might have missed this section in the docs: https://docs.rs/regex/latest/regex/bytes/index.html#syntax

The relevant sections there are:

The u flag can be disabled even when disabling it might cause the regex to match invalid UTF-8. When the u flag is disabled, the regex is said to be in “ASCII compatible” mode.

and

Hexadecimal notation can be used to specify arbitrary bytes instead of Unicode codepoints. For example, in ASCII compatible mode, \xFF matches the literal byte \xFF, while in Unicode mode, \xFF is the Unicode codepoint U+00FF that matches its UTF-8 encoding of \xC3\xBF. Similarly for octal notation when enabled.

While the pattern itself has to be valid UTF-8, you can match arbitrary byte sequences using hex escapes when Unicode mode is disabled.

But yeah this is a good question! This design was absolutely intentional and it is very much an important point that arbitrary bytes can be matched even though a regex pattern itself has to be valid UTF-8.

In theory, the implementation could be changed to support patterns that aren't valid UTF-8. Or even changed to parse straight from a &[u8]. But in practice this is usually not advantageous and it's better for comprehensibility reasons to just require that the pattern be valid UTF-8.

2 replies

johomo Feb 13, 2025
Author

Thanks for your quick reply, that was really helpful.

So, at the end of the day I was trying to build a regex pattern with sequences of bytes that may not be utf8 valid.
The point that I was missing was that sequences of literal bytes (not necessarily UTF-8 valid) in patterns written with literal strings r"" must be with hexadecimal notation.

So, if you have a sequence of bytes like this: let bytes: Vec<u8> = vec![0x9c, 0xe5];, they must converted into r"\x9c\xe5". For example, you can convert from Vec<u8> to a valid pattern in String using let bytes_as_hex = b.iter().map(|byte| format!(r"\x{:02x}", byte)).collect::<String>();.

For example:

let bytes: Vec<u8> = vec![0x9c, 0xe5];
let bytes_as_string: String = bytes.iter().map(|byte| format!(r"\x{:02x}", byte)).collect();
// Matches either bytes `\x9c\xe5` (not utf-8 valid) or bytes `b"bar"
let pattern = format!(r"(?-u)({}|bar)", bytes_as_string);

Shall I update the title of the thread to another one more suitable for people facing the same problem? Perhaps something like: How can I build a regex pattern with sequences of non utf-8 bytes?

Again, thanks a lot for your help!

BurntSushi Feb 13, 2025
Maintainer

Yeah that title sounds good! I've updated it. Thanks. :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I build a regex pattern with sequences of non utf-8 bytes? #1253

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

How can I build a regex pattern with sequences of non utf-8 bytes? #1253

johomo Feb 13, 2025

Replies: 1 comment · 2 replies

BurntSushi Feb 13, 2025 Maintainer

johomo Feb 13, 2025 Author

BurntSushi Feb 13, 2025 Maintainer

johomo
Feb 13, 2025

Replies: 1 comment 2 replies

BurntSushi
Feb 13, 2025
Maintainer

johomo Feb 13, 2025
Author

BurntSushi Feb 13, 2025
Maintainer