Skip to content

Commit

Permalink
grep: fix bugs in handling multi-line look-around
Browse files Browse the repository at this point in the history
This commit hacks in a bug fix for handling look-around across multiple
lines. The main problem is that by the time the matching lines are sent
to the printer, the surrounding context---which some look-behind or
look-ahead might have matched---could have been dropped if it wasn't
part of the set of matching lines. Therefore, when the printer re-runs
the regex engine in some cases (to do replacements, color matches, etc
etc), it won't be guaranteed to see the same matches that the searcher
found.

Overall, this is a giant clusterfuck and suggests that the way I divided
the abstraction boundary between the printer and the searcher is just
wrong. It's likely that the searcher needs to handle more of the work of
matching and pass that info on to the printer. The tricky part is that
this additional work isn't always needed. Ultimately, this means a
serious re-design of the interface between searching and printing. Sigh.

The way this fix works is to smuggle the underlying buffer used by the
searcher through into the printer. Since these bugs only impact
multi-line search (otherwise, searches are only limited to matches
across a single line), and since multi-line search always requires
having the entire file contents in a single contiguous slice (memory
mapped or on the heap), it follows that the buffer we pass through when
we need it is, in fact, the entire haystack. So this commit refactors
the printer's regex searching to use that buffer instead of the intended
bundle of bytes containing just the relevant matching portions of that
same buffer.

There is one last little hiccup: PCRE2 doesn't seem to have a way to
specify an ending position for a search. So when we re-run the search to
find matches, we can't say, "but don't search past here." Since the
buffer is likely to contain the entire file, we really cannot do
anything here other than specify a fixed upper bound on the number of
bytes to search. So if look-ahead goes more than N bytes beyond the
match, this code will break by simply being unable to find the match. In
practice, this is probably pretty rare. I believe that if we did a
better fix for this bug by fixing the interfaces, then we'd probably try
to have PCRE2 find the pertinent matches up front so that it never needs
to re-discover them.

Fixes #1412
  • Loading branch information
BurntSushi committed May 31, 2021
1 parent 721e24e commit e183c02
Show file tree
Hide file tree
Showing 13 changed files with 449 additions and 73 deletions.
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,13 @@ discussion on this. Previously, every line in a match was duplicated, even
when it spanned multiple lines. There are no changes to vimgrep output when
multi-line mode is disabled.

**In multi-line mode, --count is now equivalent to --count-matches.**

This appears to match how `pcre2grep` implements `--count`. Previously, ripgrep
would produce outright incorrect counts. Another alternative would be to simply
count the number of lines---even if it's more than the number of matches---but
that seems highly unintuitive.

Security fixes:

* [CVE-2021-3013](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3013):
Expand Down Expand Up @@ -64,6 +71,8 @@ Bug fixes:
Document cygwin path translation behavior in the FAQ.
* [BUG #1311](https://github.com/BurntSushi/ripgrep/issues/1311):
Fix multi-line bug where a search & replace for `\n` didn't work as expected.
* [BUG #1412](https://github.com/BurntSushi/ripgrep/issues/1412):
Fix multi-line bug with searches using look-around past matching lines.
* [BUG #1642](https://github.com/BurntSushi/ripgrep/issues/1642):
Fixes a bug where using `-m` and `-A` printed more matches than the limit.
* [BUG #1703](https://github.com/BurntSushi/ripgrep/issues/1703):
Expand Down
6 changes: 4 additions & 2 deletions crates/core/app.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1057,11 +1057,13 @@ fn flag_count(args: &mut Vec<RGArg>) {
This flag suppresses normal output and shows the number of lines that match
the given patterns for each file searched. Each file containing a match has its
path and count printed on each line. Note that this reports the number of lines
that match and not the total number of matches.
that match and not the total number of matches, unless -U/--multiline is
enabled. In multiline mode, --count is equivalent to --count-matches.
If only one file is given to ripgrep, then only the count is printed if there
is a match. The --with-filename flag can be used to force printing the file
path in this case.
path in this case. If you need a count to be printed regardless of whether
there is a match, then use --include-zero.
This overrides the --count-matches flag. Note that when --count is combined
with --only-matching, then ripgrep behaves as if --count-matches was given.
Expand Down
185 changes: 179 additions & 6 deletions crates/matcher/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -618,12 +618,31 @@ pub trait Matcher {
fn find_iter<F>(
&self,
haystack: &[u8],
matched: F,
) -> Result<(), Self::Error>
where
F: FnMut(Match) -> bool,
{
self.find_iter_at(haystack, 0, matched)
}

/// Executes the given function over successive non-overlapping matches
/// in `haystack`. If no match exists, then the given function is never
/// called. If the function returns `false`, then iteration stops.
///
/// The significance of the starting point is that it takes the surrounding
/// context into consideration. For example, the `\A` anchor can only
/// match when `at == 0`.
fn find_iter_at<F>(
&self,
haystack: &[u8],
at: usize,
mut matched: F,
) -> Result<(), Self::Error>
where
F: FnMut(Match) -> bool,
{
self.try_find_iter(haystack, |m| Ok(matched(m)))
self.try_find_iter_at(haystack, at, |m| Ok(matched(m)))
.map(|r: Result<(), ()>| r.unwrap())
}

Expand All @@ -637,12 +656,35 @@ pub trait Matcher {
fn try_find_iter<F, E>(
&self,
haystack: &[u8],
matched: F,
) -> Result<Result<(), E>, Self::Error>
where
F: FnMut(Match) -> Result<bool, E>,
{
self.try_find_iter_at(haystack, 0, matched)
}

/// Executes the given function over successive non-overlapping matches
/// in `haystack`. If no match exists, then the given function is never
/// called. If the function returns `false`, then iteration stops.
/// Similarly, if the function returns an error then iteration stops and
/// the error is yielded. If an error occurs while executing the search,
/// then it is converted to
/// `E`.
///
/// The significance of the starting point is that it takes the surrounding
/// context into consideration. For example, the `\A` anchor can only
/// match when `at == 0`.
fn try_find_iter_at<F, E>(
&self,
haystack: &[u8],
at: usize,
mut matched: F,
) -> Result<Result<(), E>, Self::Error>
where
F: FnMut(Match) -> Result<bool, E>,
{
let mut last_end = 0;
let mut last_end = at;
let mut last_match = None;

loop {
Expand Down Expand Up @@ -696,12 +738,33 @@ pub trait Matcher {
&self,
haystack: &[u8],
caps: &mut Self::Captures,
matched: F,
) -> Result<(), Self::Error>
where
F: FnMut(&Self::Captures) -> bool,
{
self.captures_iter_at(haystack, 0, caps, matched)
}

/// Executes the given function over successive non-overlapping matches
/// in `haystack` with capture groups extracted from each match. If no
/// match exists, then the given function is never called. If the function
/// returns `false`, then iteration stops.
///
/// The significance of the starting point is that it takes the surrounding
/// context into consideration. For example, the `\A` anchor can only
/// match when `at == 0`.
fn captures_iter_at<F>(
&self,
haystack: &[u8],
at: usize,
caps: &mut Self::Captures,
mut matched: F,
) -> Result<(), Self::Error>
where
F: FnMut(&Self::Captures) -> bool,
{
self.try_captures_iter(haystack, caps, |caps| Ok(matched(caps)))
self.try_captures_iter_at(haystack, at, caps, |caps| Ok(matched(caps)))
.map(|r: Result<(), ()>| r.unwrap())
}

Expand All @@ -716,12 +779,36 @@ pub trait Matcher {
&self,
haystack: &[u8],
caps: &mut Self::Captures,
matched: F,
) -> Result<Result<(), E>, Self::Error>
where
F: FnMut(&Self::Captures) -> Result<bool, E>,
{
self.try_captures_iter_at(haystack, 0, caps, matched)
}

/// Executes the given function over successive non-overlapping matches
/// in `haystack` with capture groups extracted from each match. If no
/// match exists, then the given function is never called. If the function
/// returns `false`, then iteration stops. Similarly, if the function
/// returns an error then iteration stops and the error is yielded. If
/// an error occurs while executing the search, then it is converted to
/// `E`.
///
/// The significance of the starting point is that it takes the surrounding
/// context into consideration. For example, the `\A` anchor can only
/// match when `at == 0`.
fn try_captures_iter_at<F, E>(
&self,
haystack: &[u8],
at: usize,
caps: &mut Self::Captures,
mut matched: F,
) -> Result<Result<(), E>, Self::Error>
where
F: FnMut(&Self::Captures) -> Result<bool, E>,
{
let mut last_end = 0;
let mut last_end = at;
let mut last_match = None;

loop {
Expand Down Expand Up @@ -819,13 +906,35 @@ pub trait Matcher {
haystack: &[u8],
caps: &mut Self::Captures,
dst: &mut Vec<u8>,
append: F,
) -> Result<(), Self::Error>
where
F: FnMut(&Self::Captures, &mut Vec<u8>) -> bool,
{
self.replace_with_captures_at(haystack, 0, caps, dst, append)
}

/// Replaces every match in the given haystack with the result of calling
/// `append` with the matching capture groups.
///
/// If the given `append` function returns `false`, then replacement stops.
///
/// The significance of the starting point is that it takes the surrounding
/// context into consideration. For example, the `\A` anchor can only
/// match when `at == 0`.
fn replace_with_captures_at<F>(
&self,
haystack: &[u8],
at: usize,
caps: &mut Self::Captures,
dst: &mut Vec<u8>,
mut append: F,
) -> Result<(), Self::Error>
where
F: FnMut(&Self::Captures, &mut Vec<u8>) -> bool,
{
let mut last_match = 0;
self.captures_iter(haystack, caps, |caps| {
let mut last_match = at;
self.captures_iter_at(haystack, at, caps, |caps| {
let m = caps.get(0).unwrap();
dst.extend(&haystack[last_match..m.start]);
last_match = m.end;
Expand Down Expand Up @@ -1039,6 +1148,18 @@ impl<'a, M: Matcher> Matcher for &'a M {
(*self).find_iter(haystack, matched)
}

fn find_iter_at<F>(
&self,
haystack: &[u8],
at: usize,
matched: F,
) -> Result<(), Self::Error>
where
F: FnMut(Match) -> bool,
{
(*self).find_iter_at(haystack, at, matched)
}

fn try_find_iter<F, E>(
&self,
haystack: &[u8],
Expand All @@ -1050,6 +1171,18 @@ impl<'a, M: Matcher> Matcher for &'a M {
(*self).try_find_iter(haystack, matched)
}

fn try_find_iter_at<F, E>(
&self,
haystack: &[u8],
at: usize,
matched: F,
) -> Result<Result<(), E>, Self::Error>
where
F: FnMut(Match) -> Result<bool, E>,
{
(*self).try_find_iter_at(haystack, at, matched)
}

fn captures(
&self,
haystack: &[u8],
Expand All @@ -1070,6 +1203,19 @@ impl<'a, M: Matcher> Matcher for &'a M {
(*self).captures_iter(haystack, caps, matched)
}

fn captures_iter_at<F>(
&self,
haystack: &[u8],
at: usize,
caps: &mut Self::Captures,
matched: F,
) -> Result<(), Self::Error>
where
F: FnMut(&Self::Captures) -> bool,
{
(*self).captures_iter_at(haystack, at, caps, matched)
}

fn try_captures_iter<F, E>(
&self,
haystack: &[u8],
Expand All @@ -1082,6 +1228,19 @@ impl<'a, M: Matcher> Matcher for &'a M {
(*self).try_captures_iter(haystack, caps, matched)
}

fn try_captures_iter_at<F, E>(
&self,
haystack: &[u8],
at: usize,
caps: &mut Self::Captures,
matched: F,
) -> Result<Result<(), E>, Self::Error>
where
F: FnMut(&Self::Captures) -> Result<bool, E>,
{
(*self).try_captures_iter_at(haystack, at, caps, matched)
}

fn replace<F>(
&self,
haystack: &[u8],
Expand All @@ -1107,6 +1266,20 @@ impl<'a, M: Matcher> Matcher for &'a M {
(*self).replace_with_captures(haystack, caps, dst, append)
}

fn replace_with_captures_at<F>(
&self,
haystack: &[u8],
at: usize,
caps: &mut Self::Captures,
dst: &mut Vec<u8>,
append: F,
) -> Result<(), Self::Error>
where
F: FnMut(&Self::Captures, &mut Vec<u8>) -> bool,
{
(*self).replace_with_captures_at(haystack, at, caps, dst, append)
}

fn is_match(&self, haystack: &[u8]) -> Result<bool, Self::Error> {
(*self).is_match(haystack)
}
Expand Down
34 changes: 24 additions & 10 deletions crates/printer/src/json.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ use std::time::Instant;

use grep_matcher::{Match, Matcher};
use grep_searcher::{
Searcher, Sink, SinkContext, SinkContextKind, SinkError, SinkFinish,
SinkMatch,
Searcher, Sink, SinkContext, SinkContextKind, SinkFinish, SinkMatch,
};
use serde_json as json;

use counter::CounterWriter;
use jsont;
use stats::Stats;
use util::find_iter_at_in_context;

/// The configuration for the JSON printer.
///
Expand Down Expand Up @@ -603,7 +603,12 @@ impl<'p, 's, M: Matcher, W: io::Write> JSONSink<'p, 's, M, W> {

/// Execute the matcher over the given bytes and record the match
/// locations if the current configuration demands match granularity.
fn record_matches(&mut self, bytes: &[u8]) -> io::Result<()> {
fn record_matches(
&mut self,
searcher: &Searcher,
bytes: &[u8],
range: std::ops::Range<usize>,
) -> io::Result<()> {
self.json.matches.clear();
// If printing requires knowing the location of each individual match,
// then compute and stored those right now for use later. While this
Expand All @@ -612,12 +617,17 @@ impl<'p, 's, M: Matcher, W: io::Write> JSONSink<'p, 's, M, W> {
// the extent that it's easy to ensure that we never do more than
// one search to find the matches.
let matches = &mut self.json.matches;
self.matcher
.find_iter(bytes, |m| {
matches.push(m);
find_iter_at_in_context(
searcher,
&self.matcher,
bytes,
range.clone(),
|m| {
let (s, e) = (m.start() - range.start, m.end() - range.start);
matches.push(Match::new(s, e));
true
})
.map_err(io::Error::error_message)?;
},
)?;
// Don't report empty matches appearing at the end of the bytes.
if !matches.is_empty()
&& matches.last().unwrap().is_empty()
Expand Down Expand Up @@ -691,7 +701,11 @@ impl<'p, 's, M: Matcher, W: io::Write> Sink for JSONSink<'p, 's, M, W> {
self.after_context_remaining = searcher.after_context() as u64;
}

self.record_matches(mat.bytes())?;
self.record_matches(
searcher,
mat.buffer(),
mat.bytes_range_in_buffer(),
)?;
self.stats.add_matches(self.json.matches.len() as u64);
self.stats.add_matched_lines(mat.lines().count() as u64);

Expand Down Expand Up @@ -720,7 +734,7 @@ impl<'p, 's, M: Matcher, W: io::Write> Sink for JSONSink<'p, 's, M, W> {
self.after_context_remaining.saturating_sub(1);
}
let submatches = if searcher.invert_match() {
self.record_matches(ctx.bytes())?;
self.record_matches(searcher, ctx.bytes(), 0..ctx.bytes().len())?;
SubMatches::new(ctx.bytes(), &self.json.matches)
} else {
SubMatches::empty()
Expand Down
Loading

0 comments on commit e183c02

Please sign in to comment.