Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

-w does not interact well with metacharacters #445

Closed
epa opened this issue Apr 9, 2014 · 14 comments
Closed

-w does not interact well with metacharacters #445

epa opened this issue Apr 9, 2014 · 14 comments

Comments

@epa
Copy link

epa commented Apr 9, 2014

-w is documented as matching whole words only, and that it works by wrapping the pattern in \b metacharacters. Given this, you would expect two equivalent regexps to give the same behaviour, e.g.

abc
(?:abc)

These are exactly equivalent perl5 regular expressions, but they behave differently with -w:

% mkdir n; cd n
% echo abcde >foo
% ack -w 'abc'
% ack -w '(?:abc)'
foo
1:abcde

The fix is quite simple: at present the code doesn't add the \b anchor at the start if the regexp doesn't start with a word character, ditto end. That doesn't match the documentation, which says that the regexp is wrapped with \b unconditionally. Patch to make behaviour match docs:

diff --git a/ack b/ack
index 38ba811..978d893 100644
--- a/ack
+++ b/ack
@@ -307,11 +307,9 @@ sub build_regex {

     $str = quotemeta( $str ) if $opt->{Q};
     if ( $opt->{w} ) {
-        my $pristine_str = $str;
-
         $str = "(?:$str)";
-        $str = "\\b$str" if $pristine_str =~ /^\w/;
-        $str = "$str\\b" if $pristine_str =~ /\w$/;
+        $str = "\\b$str";
+        $str = "$str\\b";
     }

     my $regex_is_lc = $str eq lc $str;
@hoelzro
Copy link
Collaborator

hoelzro commented Apr 9, 2014

See also #14

@hoelzro
Copy link
Collaborator

hoelzro commented Apr 9, 2014

The reason this is difficult is that \b matches word boundaries (\w vs \W) instead of whitespace boundaries (\S vs \s). In issue #14, we refer to a failing test that attempts ack -w mu., which would produce unexpected behavior on something like mu(, because you'd expect it to match, but it wouldn't. I'm not sure what the right solution is here, but I don't know if removing the check is the right thing to do.

@epa
Copy link
Author

epa commented Apr 9, 2014

Right, but currently the documentation explicitly says that \b is used, which implies the usual \b semantics of matching word boundaries rather than whitespace boundaries. So I would say that the test with mu. is incorrect, in that it is expecting something different to what the documentation says.

Let's fix the documentation and the test suite to be consistent with each other, and then we have some hope of making the code consistent with both...

@hoelzro
Copy link
Collaborator

hoelzro commented Apr 9, 2014

Sounds good to me!

@hoelzro hoelzro added this to the 2.14 milestone Sep 3, 2014
@epa
Copy link
Author

epa commented Jan 13, 2015

Hi, any update on this? I believe it is still outstanding in 2.14 despite this bug being added to the 2.14 milestone.

@petdance
Copy link
Collaborator

Someone needs to fix the differences between the docs and the tests. If you'd like to do that and submit a pull request, I'm interested.

@epa
Copy link
Author

epa commented Jan 13, 2015

Here's a fix which I think makes the behaviour consistent with the test suite, the documentation consistent with the behaviour, and (IMHO) matches the most common user expectations. Sorry, I am still fiddling with my github account so I wasn't able to make a pull request, but it is a patch to one file:

diff --git a/ack b/ack
index 3dee0b0..89d6e74 100644
--- a/ack
+++ b/ack
@@ -316,11 +316,9 @@ sub build_regex {

     $str = quotemeta( $str ) if $opt->{Q};
     if ( $opt->{w} ) {
-        my $pristine_str = $str;
-
         $str = "(?:$str)";
-        $str = "\\b$str" if $pristine_str =~ /^\w/;
-        $str = "$str\\b" if $pristine_str =~ /\w$/;
+        $str = "(?:\\b|(?!\\w))$str";
+        $str = "$str(?:\\b|(?<!\\w))";
     }

     my $regex_is_lc = $str eq lc $str;
@@ -1524,8 +1522,10 @@ Display version and copyright information.

 =item B<-w>, B<--word-regexp>

-Force PATTERN to match only whole words.  The PATTERN is wrapped with
-C<\b> metacharacters.
+Match a whole word only.  In more detail: if the match begins with a
+word character, then there must not be a word character immediately
+before the match.  If the match ends with a word character, there must
+not be a word character immediately after the match.

 =item B<-x>

@epa
Copy link
Author

epa commented Jan 28, 2015

Hi, have you had a chance to look at the patch?

@hoelzro
Copy link
Collaborator

hoelzro commented Jan 28, 2015

@epa The patch seems solid to me; I know @petdance wants a sort of code freeze for 2.16, though.

@epa
Copy link
Author

epa commented Jan 29, 2015

Thanks for reviewing, @hoelzro. I hope the patch can still be applied for 2.16 since it is self-contained and only affects the -w flag.

@epa
Copy link
Author

epa commented Oct 1, 2015

@hoelzro, thanks for reviewing the patch back in January. I have an updated version of the patch at #558 - do you think you could have a look at that one? Let me know if anything further is needed.

@hoelzro
Copy link
Collaborator

hoelzro commented Jan 6, 2016

@epa Sorry this has taken so long to review. I'm going to pull @petdance in in #558, since he has the final say when it comes to changing behavior that users may be relying on.

epa pushed a commit to epa/ack2 that referenced this issue Feb 27, 2017
@petdance
Copy link
Collaborator

This will have to be fixed in ack 3.

@petdance petdance removed this from the 2.16 milestone Mar 13, 2017
@petdance
Copy link
Collaborator

This has been fixed in ack3. This behavior won't change in ack2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants