Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ending newline character of a line seems to be exposed to the regex engine with -P option #1401

Closed
learnbyexample opened this issue Oct 10, 2019 · 2 comments
Labels
bug A bug. rollup A PR that has been merged with many others in a rollup.

Comments

@learnbyexample
Copy link

What version of ripgrep are you using?

$ rg --version
ripgrep 11.0.2 (rev 3de31f7527)
-SIMD -AVX (compiled)
+SIMD -AVX (runtime)

How did you install ripgrep?

Using ripgrep_11.0.2_amd64.deb

What operating system are you using ripgrep on?

Ubuntu LTS 16.04

Describe your question, feature request, or bug.

When -P option is used, \s seems to take ending newline character of input line into consideration.

If this is a bug, what are the steps to reproduce the behavior?

Consider this sample input file:

$ printf 'foo 42\nxoyz\ncat\tdog\n' > ip.txt
$ cat ip.txt
foo 42
xoyz
cat	dog

Here's a minimal example showing the issue (I needed lookarounds, hence the need for -P switch)

$ # extract till last 'o' in the line, if there are no more whitespaces after 'o'
$ # the issue is that characters after 'o' are also displayed despite the -o switch
$ rg -NoP '.*o(?!.*\s)' ip.txt
xoyz
cat	dog

$ # works if I replace \s with space/tab character class or use GNU grep
$ # using \s(?!$) instead of \s is another workaround that works
$ rg -NoP '.*o(?!.*[ \t])' ip.txt
xo
cat	do
$ grep -oP '.*o(?!.*\s)' ip.txt
xo
cat	do

Another example shown below:

$ rg -No '.*\s' ip.txt
foo 
cat	

$ rg -NoP '.*\s' ip.txt
foo 42
cat	dog
@BurntSushi
Copy link
Owner

Interesting, yeah, I think I might see why this happens. With the default regex engine, all character classes containing \n (including \s) are rewritten to not contain \n. But doing this isn't possible with PCRE2. Of course, this is why ripgrep then uses PCRE2 on a line-by-line basis instead of searching through many lines at once. My guess is that it is including the the \n at the end of each line, and it probably shouldn't be. Although, I haven't thought through the full consequences of that. I will attempt a fix for this at some point, but it is possible that this will end up being a wontfix bug.

Thanks for filing this!

@BurntSushi BurntSushi added the bug A bug. label Oct 10, 2019
@maage
Copy link

maage commented Aug 19, 2020

Some extended testing.
Tests ran under Fedora 32 x86_64.

$ rg --version
ripgrep 12.1.1
-SIMD -AVX (compiled)
+SIMD -AVX (runtime)
$ pcre2grep --version
pcre2grep version 10.35 2020-05-09
$ perl --version

This is perl 5, version 30, subversion 3 (v5.30.3) built for x86_64-linux-thread-multi
(with 75 registered patches, see perl -V for more detail)
...
$ grep --version
grep (GNU grep) 3.3
...

Test data:

$ printf 'foo 42\nxoyz\ncat\tdog\nfoo' > ip.txt
$ cat ip.txt; echo "<EOF>" 
foo 42
xoyz
cat	dog
foo<EOF>

Just to make comparison complete:

$ pcre2grep -o '.*o(?!.*\s)' ip.txt 
xo
cat	do
foo
$ perl -ne 'chomp;print "$&\n" if /.*o(?!.*\s)/' ip.txt 
xo
cat	do
foo
$ grep -oP '.*o(?!.*\s)' ip.txt
xo
cat	do
foo
$ perl -ne 'print "$&\n" if /.*o(?!.*\s)/' ip.txt 
foo

In perl line contains \n at the end when you go over line by line. Seems pcre2grep does do that. Grep colors all matches correctly. Others do not do coloring.

With rg:

$ rg -NoP '.*o(?!.*\s)' ip.txt
xoyz
cat	dog
foo

Only "foo" is colored as match. Somehow first two lines end up being printed fully even if they are not marked as matches.

@BurntSushi BurntSushi added the rollup A PR that has been merged with many others in a rollup. label May 31, 2021
BurntSushi added a commit that referenced this issue Jun 1, 2021
This fixes a bug where PCRE2 look-around could change the result of a
match if it observed a line terminator in the printer. And in
particular, this is precisely how the searcher operates: the line is
considered unto itself *without* the line terminator.

Fixes #1401
BurntSushi added a commit that referenced this issue Jun 1, 2021
This is basically the same bug as #1401, but applied to replacements
instead of --only-matching.

Fixes #1739
BurntSushi added a commit that referenced this issue Jun 1, 2021
This is basically the same bug as #1401, but applied to replacements
instead of --only-matching.

Fixes #1739
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug. rollup A PR that has been merged with many others in a rollup.
Projects
None yet
Development

No branches or pull requests

3 participants