Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiline lookbehind, empty submatches array #1412

Closed
roblourens opened this issue Oct 28, 2019 · 1 comment
Closed

Multiline lookbehind, empty submatches array #1412

roblourens opened this issue Oct 28, 2019 · 1 comment
Labels
bug A bug. rollup A PR that has been merged with many others in a rollup.

Comments

@roblourens
Copy link
Contributor

If I search with a lookbehind pattern that matches a newline, and use the --json flag, I get the match info but it has an empty submatches array. Example:

$ echo -e 'foo\nbar' > test.txt
$ rg --auto-hybrid-regex --multiline --json '(?<=foo\n)bar' test.txt
{"type":"begin","data":{"path":{"text":"test.txt"}}}
{"type":"match","data":{"path":{"text":"test.txt"},"lines":{"text":"bar\n"},"line_number":2,"absolute_offset":4,"submatches":[]}}
{"type":"end","data":{"path":{"text":"test.txt"},"binary_offset":null,"stats":{"elapsed":{"secs":0,"nanos":40700,"human":"0.000041s"},"searches":1,"searches_with_match":1,"bytes_searched":8,"bytes_printed":183,"matched_lines":1,"matches":0}}}
{"data":{"elapsed_total":{"human":"0.002328s","nanos":2328499,"secs":0},"stats":{"bytes_printed":183,"bytes_searched":8,"elapsed":{"human":"0.000041s","nanos":40700,"secs":0},"matched_lines":1,"matches":0,"searches":1,"searches_with_match":1}},"type":"summary"}

Compare to what I get when matching with lookbehind on one line:

$ rg --auto-hybrid-regex --multiline --json '(?<=b)ar' test.txt 
...
{"type":"match","data":{"path":{"text":"test.txt"},"lines":{"text":"bar\n"},"line_number":2,"absolute_offset":4,"submatches":[{"match":{"text":"ar"},"start":1,"end":3}]}}

Or just matching that line

$ rg --auto-hybrid-regex --multiline --json 'bar' test.txt
...
{"type":"match","data":{"path":{"text":"test.txt"},"lines":{"text":"bar\n"},"line_number":2,"absolute_offset":4,"submatches":[{"match":{"text":"bar"},"start":0,"end":3}]}}

Same with lookahead that matches a newline.

@BurntSushi BurntSushi added the bug A bug. label Mar 15, 2020
@alessandroasm
Copy link
Contributor

Hi, I looked into what is happening in this case.

/// Execute the matcher over the given bytes and record the match
/// locations if the current configuration demands match granularity.
fn record_matches(&mut self, bytes: &[u8]) -> io::Result<()> {
self.json.matches.clear();
// If printing requires knowing the location of each individual match,
// then compute and stored those right now for use later. While this
// adds an extra copy for storing the matches, we do amortize the
// allocation for it and this greatly simplifies the printing logic to
// the extent that it's easy to ensure that we never do more than
// one search to find the matches.
let matches = &mut self.json.matches;

The issue is that grep_searcher::SinkMatch doesn't store captured groups. When JSONSink needs details about capture groups, it reruns the matcher against the bytes in SinkMatch, but those bytes only contain the current line, which in turn results in a non-match scenario.

I think the way to go here will be including submatch data in grep_searcher::SinkMatch. That way we won't need to rerun the matcher to get this information.

@BurntSushi I can implement that if you agree to this changes

@BurntSushi BurntSushi added the rollup A PR that has been merged with many others in a rollup. label May 31, 2021
BurntSushi added a commit that referenced this issue Jun 1, 2021
This commit hacks in a bug fix for handling look-around across multiple
lines. The main problem is that by the time the matching lines are sent
to the printer, the surrounding context---which some look-behind or
look-ahead might have matched---could have been dropped if it wasn't
part of the set of matching lines. Therefore, when the printer re-runs
the regex engine in some cases (to do replacements, color matches, etc
etc), it won't be guaranteed to see the same matches that the searcher
found.

Overall, this is a giant clusterfuck and suggests that the way I divided
the abstraction boundary between the printer and the searcher is just
wrong. It's likely that the searcher needs to handle more of the work of
matching and pass that info on to the printer. The tricky part is that
this additional work isn't always needed. Ultimately, this means a
serious re-design of the interface between searching and printing. Sigh.

The way this fix works is to smuggle the underlying buffer used by the
searcher through into the printer. Since these bugs only impact
multi-line search (otherwise, searches are only limited to matches
across a single line), and since multi-line search always requires
having the entire file contents in a single contiguous slice (memory
mapped or on the heap), it follows that the buffer we pass through when
we need it is, in fact, the entire haystack. So this commit refactors
the printer's regex searching to use that buffer instead of the intended
bundle of bytes containing just the relevant matching portions of that
same buffer.

There is one last little hiccup: PCRE2 doesn't seem to have a way to
specify an ending position for a search. So when we re-run the search to
find matches, we can't say, "but don't search past here." Since the
buffer is likely to contain the entire file, we really cannot do
anything here other than specify a fixed upper bound on the number of
bytes to search. So if look-ahead goes more than N bytes beyond the
match, this code will break by simply being unable to find the match. In
practice, this is probably pretty rare. I believe that if we did a
better fix for this bug by fixing the interfaces, then we'd probably try
to have PCRE2 find the pertinent matches up front so that it never needs
to re-discover them.

Fixes #1412
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug. rollup A PR that has been merged with many others in a rollup.
Projects
None yet
Development

No branches or pull requests

3 participants