Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In multi-line mode, ^$ unconditionally matches last line in text #355

Closed
BatmanAoD opened this issue Apr 7, 2017 · 6 comments
Closed

In multi-line mode, ^$ unconditionally matches last line in text #355

BatmanAoD opened this issue Apr 7, 2017 · 6 comments

Comments

@BatmanAoD
Copy link
Member

I've discovered that ^$, which should match empty lines, seems to unconditionally match the last line in text:

    let re = Regex::new("(?m)^$").unwrap();
    let maybe_match = re.find("foo\n");
    if let Some(m) = maybe_match {
        println!("{:?}", m);
    }

This should not match, since there are no empty lines in foo\n; however, it prints:

Match { text: "foo\n", start: 4, end: 4 }

I believe this is a bug.

(First discussed here: BurntSushi/ripgrep#416.)

@BatmanAoD
Copy link
Member Author

From the linked ripgrep issue comment:

I believe it can probably be fixed by tweaking the end-of-file flag-setting logic. (An off-the-cuff suggestion: perhaps the start-of-line flag needs to be prohibited at end-of-file, regardless of what the last byte is. I'm not sure what the value is in having ^ match when there's nothing left to read in the file.)

@BurntSushi
Copy link
Member

This isn't a bug. The docs could be clearer, but they say:

^     the beginning of text (or start-of-line with multi-line mode)
$     the end of text (or end-of-line with multi-line mode)

In this case, the start of a line is the position immediately following \n. In the string foo\n, there is exactly one such position, and that corresponds to the position immediately following the \n. Notably, the position immediately following the last byte in a haystack is a valid match position.

This behavior also matches the semantics of Go, RE2 and Python.

@BurntSushi
Copy link
Member

BurntSushi commented Apr 7, 2017

From Russ Cox's article, it seems this doesn't match PCRE's behavior:

Similarly, in multi-line mode, if the input text ends with a newline character, Perl and PCRE do not allow ^, which normally matches following a newline, to match at the very end of the text. RE2 does.

I've truthfully never thought deeply about these particular semantics, but I think there would need to be a very compelling argument (probably insatiable) to change them at this point.

@BurntSushi
Copy link
Member

And at least some of the tests here depend on the current semantics, although none of the tests precisely match your example: https://github.com/rust-lang/regex/blob/master/tests/multiline.rs

@BatmanAoD
Copy link
Member Author

BatmanAoD commented Apr 8, 2017

I'm very surprised (again) by the semantics of the Python regex engine, which I thought I knew fairly well. I was definitely under the impression that it was designed to match PCRE.

I did test that this doesn't match Perl's behavior. And for a grep tool like ripgrep, it is simply wrong, because finding empty lines is sometimes useful, and a false positive at the end of every file is not helpful. That doesn't mean that the behavior of this crate needs to change, since ripgrep could simply work around the issue, but you did direct me this way when I mentioned it on the ripgrep issue tracker... 😕 Either way, I am strongly of the opinion that ripgrep's behavior should match grep's for this case.

@BurntSushi
Copy link
Member

OK, I've posted a follow up on the ripgrep tracker: BurntSushi/ripgrep#416

I'll close this since I don't think this is a bug, it is consistent with at least several other popular engines, hasn't caused anyone any grief (that I know of) and would be a breaking change if we wanted to match PCRE exactly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants