implement backwards regex matching #96

BurntSushi · 2015-06-25T13:06:56Z

With @WillEngler working on the DFA, I'm itching to do more. One idea I got from grep (and RE2) is the capability of running automata backwards over the input.

For example, today, we are very good at literal prefix matching because the prefixes get compiled down to a DFA. But what if the regex has a literal suffix? We don't do anything clever. What if we had the ability to run the engines backwards? We could identify suffixes just like prefixes, and then run the engine backwards from the suffix.

The possibilities get even better. What if the regex has neither a literal prefix nor a literal suffix, but has a string somewhere in the middle that must be matched? e.g., \w+foo\w+. We could use a fast substring search to find foo and then, in theory, run the engine both backwards and forwards to find the start and ending locations.

Finally, we'll want machine reversal to make the most out of a DFA.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2015-07-01T16:43:13Z

I've been noodling on this for the past week. Supporting suffixes seems somewhat straight-forward assuming we get backwards search working. However, supporting the string-in-the-middle optimization is harder. In particular, we want to discover the set of all literal substrings such that at least one of them must appear in the search text in order for the regex to match. rsc gives an algebraic formulation here. But we need to go further.

Say we have our set of literal substrings and we build an automaton to match them (via AC). So we run it on the search text and find a match. Now what? In grep, it does line-based matching, so it just needs to extend in both directions until it finds a new line character or beginning/end of input. But in the general case, this can't be done. However, if we knew where the required literal started in the regex instruction sequence, then we could start the machine at the substring match and run it both forwards and backwards (ignoring capture groups). This would give us the start & end locations, after which, we could run the full NFA on just that substring.

rsc's algebraic formulation doesn't give us this information because it is algebraic. We instead need to perform the same operation on the regex instruction sequence, which forms a graph. Doing it this way seems a little simpler because we just need to find all paths from the start of the graph to the match state, then for each path, concatenate the literals. Unioning all of the concatenations should yield the set of required literals. (This process will of course need to be bounded, particularly around character classes!)

WillEngler · 2015-07-01T20:12:40Z

@BurntSushi Sounds promising! I haven't read the codesearch code or rsc's Trigram Index post deeply enough to follow why the algebraic formulation can't give start and end locations, but it does seem like the solution you're outlining is simpler.

One question about "... then for each path, concatenate the literals, ..." By literals, do you mean all Char and Ranges instructions? I'm trying to visualize how this would work with byte matching.

BurntSushi · 2015-07-01T20:28:44Z

The algebraic formulation is on the abstract syntax of a regex. The abstract syntax is, well, abstract. It's divorced from the automaton, which means extracting literals from it and the instruction at which that literal starts is impossible (or very hard).

One question about "... then for each path, concatenate the literals, ..." By literals, do you mean all Char and Ranges instructions? I'm trying to visualize how this would work with byte matching.

Yeah, that's right. The key is that the Ranges instruction is superfluous in the current instruction set. It could technically be implemented in terms of Split instructions, but it is much more efficient to define a separate instruction for binary search. So in my formulation here, I have to treat ranges as splits in the graph. e.g., a[cd]z should produce the same literals as a(c|d)z: acz and adz.

The part I still have to iron out is how to bound it. e.g., a.z could also be considered as an a followed by many splits corresponding to every character (sans \n). The idea I have handles this just fine, but obviously it needs to be bounded! I think this is where byte ranges might need to be handle specially, but it shouldn't be too bad. (i.e., You probably shouldn't cut the literal in the middle of an encoded codepoint!)

BurntSushi · 2015-07-01T20:29:12Z

(When I understand my approach better, I'll write out some simpler rules. But I think I have to play with code first.)

WillEngler · 2015-07-01T20:42:49Z

The algebraic formulation is on the abstract syntax of a regex. The abstract syntax is, well, abstract. It's divorced from the automaton, which means extracting literals from it and the instruction at which that literal starts is impossible (or very hard).

Thanks, that clears things up.

The part I still have to iron out is how to bound it. e.g., a.z could also be considered as an a followed by many splits corresponding to every character (sans \n).

Ah, interesting! I'm curious to see how you'll handle it.

The principle change in this commit is a complete rewrite of how literals are detected from a regular expression. In particular, we now traverse the abstract syntax to discover literals instead of the compiled byte code. This permits more tuneable control over which and how many literals are extracted, and is now exposed in the `regex-syntax` crate so that others can benefit from it. Other changes in this commit: * The Boyer-Moore algorithm was rewritten to use my own concoction based on frequency analysis. We end up regressing on a couple benchmarks slightly because of this, but gain in some others and in general should be faster in a broader number of cases. (Principally because we try to run `memchr` on the rarest byte in a literal.) This should also greatly improve handling of non-Western text. * A "reverse suffix" literal optimization was added. That is, if suffix literals exist but no prefix literals exist, then we can quickly scan for suffix matches and then run the DFA in reverse to find matches. (I'm not aware of any other regex engine that does this.) * The mutex-based pool has been replaced with a spinlock-based pool (from the new `mempool` crate). This reduces some amount of constant overhead and improves several benchmarks that either search short haystacks or find many matches in long haystacks. * Search parameters have been refactored. * RegexSet can now contain 0 or more regular expressions (previously, it could only contain 2 or more). The InvalidSet error variant is now deprecated. * A bug in computing start states was fixed. Namely, the DFA assumed the start states was always the first instruction, which is trivially wrong for an expression like `^☃$`. This bug persisted because it typically occurred when a literal optimization would otherwise run. * A new CLI tool, regex-debug, has been added as a non-published sub-crate. The CLI tool can answer various facts about regular expressions, such as printing its AST, its compiled byte code or its detected literals. Closes #96, #188, #189

BurntSushi · 2016-03-28T20:44:48Z

This was done in 31a317e. It isn't quite ambitious as what I had in mind, but it does indeed scan for suffix literals, and if one is found, runs the DFA in reverse to find the start location of the match.

BurntSushi self-assigned this Jun 25, 2015

BurntSushi added the enhancement label Jun 25, 2015

BurntSushi removed their assignment Sep 21, 2015

BurntSushi mentioned this issue Mar 27, 2016

Major literal optimization refactoring. #190

Merged

BurntSushi closed this as completed in #190 Mar 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement backwards regex matching #96

implement backwards regex matching #96

BurntSushi commented Jun 25, 2015

BurntSushi commented Jul 1, 2015

WillEngler commented Jul 1, 2015

BurntSushi commented Jul 1, 2015

BurntSushi commented Jul 1, 2015

WillEngler commented Jul 1, 2015

BurntSushi commented Mar 28, 2016

implement backwards regex matching #96

implement backwards regex matching #96

Comments

BurntSushi commented Jun 25, 2015

BurntSushi commented Jul 1, 2015

WillEngler commented Jul 1, 2015

BurntSushi commented Jul 1, 2015

BurntSushi commented Jul 1, 2015

WillEngler commented Jul 1, 2015

BurntSushi commented Mar 28, 2016