How can I build a regex pattern with sequences of non utf-8 bytes? #1253
-
First of all, thank you all for maintaining such a great crate 🫶 Now, straight to the point. The signature of the regex::bytes::Regex::new takes a regular expression as a As an example, the following snippet attempts to find a needle use regex::bytes::RegexBuilder;
fn main() {
let hay: &[u8] = b"The following bytes are not UTF8 valid: \x9c\xe5";
let pattern_as_bytes: &[u8] = b"\x9c\xe5";
let re = RegexBuilder::new(pattern_as_bytes).unicode(false).build().expect("Invalid regex pattern");
// ----------------- ^^^^^^^^^^^^^^^^^ expected `&str`, found `&[u8]`
// |
// arguments to this function are incorrect
// Find all occurrences of the pattern in the text
for mat in re.find_iter(hay) {
println!("Found match at: {:?}", mat);
}
} For the sake of completness, this exact example can be accomplished with My question is: What is the motivation to only allow patterns as instances of To be honest, I'm not sure if it makes any sense to allow patterns as bytes. For instance, how would you write a pattern such as "match either bytes I feel I am missing something big. That's why I'm really interested in your point of view as experts in this topic. Thanks a lot for your time. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
That's incorrect. Here's a counter-example: use regex::bytes::Regex;
fn main() {
let haystack = b"foo bar\xFF baz";
let re = Regex::new(r"(?-u)\xFF").unwrap();
assert_eq!(re.find(haystack).map(|r| r.range()), Some(7..8));
}
You might have missed this section in the docs: https://docs.rs/regex/latest/regex/bytes/index.html#syntax The relevant sections there are:
and
While the pattern itself has to be valid UTF-8, you can match arbitrary byte sequences using hex escapes when Unicode mode is disabled. But yeah this is a good question! This design was absolutely intentional and it is very much an important point that arbitrary bytes can be matched even though a regex pattern itself has to be valid UTF-8. In theory, the implementation could be changed to support patterns that aren't valid UTF-8. Or even changed to parse straight from a |
Beta Was this translation helpful? Give feedback.
That's incorrect. Here's a counter-example:
Playground link.
You might have missed this section in the docs: https://docs.rs/regex/latest/regex/bytes/index.html#syntax
The relevant sections there are: