Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement backwards regex matching #96

Closed
BurntSushi opened this issue Jun 25, 2015 · 6 comments · Fixed by #190
Closed

implement backwards regex matching #96

BurntSushi opened this issue Jun 25, 2015 · 6 comments · Fixed by #190

Comments

@BurntSushi
Copy link
Member

With @WillEngler working on the DFA, I'm itching to do more. One idea I got from grep (and RE2) is the capability of running automata backwards over the input.

For example, today, we are very good at literal prefix matching because the prefixes get compiled down to a DFA. But what if the regex has a literal suffix? We don't do anything clever. What if we had the ability to run the engines backwards? We could identify suffixes just like prefixes, and then run the engine backwards from the suffix.

The possibilities get even better. What if the regex has neither a literal prefix nor a literal suffix, but has a string somewhere in the middle that must be matched? e.g., \w+foo\w+. We could use a fast substring search to find foo and then, in theory, run the engine both backwards and forwards to find the start and ending locations.

Finally, we'll want machine reversal to make the most out of a DFA.

@BurntSushi BurntSushi self-assigned this Jun 25, 2015
@BurntSushi
Copy link
Member Author

I've been noodling on this for the past week. Supporting suffixes seems somewhat straight-forward assuming we get backwards search working. However, supporting the string-in-the-middle optimization is harder. In particular, we want to discover the set of all literal substrings such that at least one of them must appear in the search text in order for the regex to match. rsc gives an algebraic formulation here. But we need to go further.

Say we have our set of literal substrings and we build an automaton to match them (via AC). So we run it on the search text and find a match. Now what? In grep, it does line-based matching, so it just needs to extend in both directions until it finds a new line character or beginning/end of input. But in the general case, this can't be done. However, if we knew where the required literal started in the regex instruction sequence, then we could start the machine at the substring match and run it both forwards and backwards (ignoring capture groups). This would give us the start & end locations, after which, we could run the full NFA on just that substring.

rsc's algebraic formulation doesn't give us this information because it is algebraic. We instead need to perform the same operation on the regex instruction sequence, which forms a graph. Doing it this way seems a little simpler because we just need to find all paths from the start of the graph to the match state, then for each path, concatenate the literals. Unioning all of the concatenations should yield the set of required literals. (This process will of course need to be bounded, particularly around character classes!)

@WillEngler
Copy link

@BurntSushi Sounds promising! I haven't read the codesearch code or rsc's Trigram Index post deeply enough to follow why the algebraic formulation can't give start and end locations, but it does seem like the solution you're outlining is simpler.

One question about "... then for each path, concatenate the literals, ..." By literals, do you mean all Char and Ranges instructions? I'm trying to visualize how this would work with byte matching.

@BurntSushi
Copy link
Member Author

The algebraic formulation is on the abstract syntax of a regex. The abstract syntax is, well, abstract. It's divorced from the automaton, which means extracting literals from it and the instruction at which that literal starts is impossible (or very hard).

One question about "... then for each path, concatenate the literals, ..." By literals, do you mean all Char and Ranges instructions? I'm trying to visualize how this would work with byte matching.

Yeah, that's right. The key is that the Ranges instruction is superfluous in the current instruction set. It could technically be implemented in terms of Split instructions, but it is much more efficient to define a separate instruction for binary search. So in my formulation here, I have to treat ranges as splits in the graph. e.g., a[cd]z should produce the same literals as a(c|d)z: acz and adz.

The part I still have to iron out is how to bound it. e.g., a.z could also be considered as an a followed by many splits corresponding to every character (sans \n). The idea I have handles this just fine, but obviously it needs to be bounded! I think this is where byte ranges might need to be handle specially, but it shouldn't be too bad. (i.e., You probably shouldn't cut the literal in the middle of an encoded codepoint!)

@BurntSushi
Copy link
Member Author

(When I understand my approach better, I'll write out some simpler rules. But I think I have to play with code first.)

@WillEngler
Copy link

The algebraic formulation is on the abstract syntax of a regex. The abstract syntax is, well, abstract. It's divorced from the automaton, which means extracting literals from it and the instruction at which that literal starts is impossible (or very hard).

Thanks, that clears things up.

The part I still have to iron out is how to bound it. e.g., a.z could also be considered as an a followed by many splits corresponding to every character (sans \n).

Ah, interesting! I'm curious to see how you'll handle it.

@BurntSushi BurntSushi removed their assignment Sep 21, 2015
BurntSushi added a commit that referenced this issue Mar 27, 2016
The principle change in this commit is a complete rewrite of how
literals are detected from a regular expression. In particular, we now
traverse the abstract syntax to discover literals instead of the
compiled byte code. This permits more tuneable control over which and
how many literals are extracted, and is now exposed in the
`regex-syntax` crate so that others can benefit from it.

Other changes in this commit:

* The Boyer-Moore algorithm was rewritten to use my own concoction based
  on frequency analysis. We end up regressing on a couple benchmarks
  slightly because of this, but gain in some others and in general should
  be faster in a broader number of cases. (Principally because we try to
  run `memchr` on the rarest byte in a literal.) This should also greatly
  improve handling of non-Western text.
* A "reverse suffix" literal optimization was added. That is, if suffix
  literals exist but no prefix literals exist, then we can quickly scan
  for suffix matches and then run the DFA in reverse to find matches.
  (I'm not aware of any other regex engine that does this.)
* The mutex-based pool has been replaced with a spinlock-based pool
  (from the new `mempool` crate). This reduces some amount of constant
  overhead and improves several benchmarks that either search short
  haystacks or find many matches in long haystacks.
* Search parameters have been refactored.
* RegexSet can now contain 0 or more regular expressions (previously, it
  could only contain 2 or more). The InvalidSet error variant is now
  deprecated.
* A bug in computing start states was fixed. Namely, the DFA assumed the
  start states was always the first instruction, which is trivially
  wrong for an expression like `^☃$`. This bug persisted because it
  typically occurred when a literal optimization would otherwise run.
* A new CLI tool, regex-debug, has been added as a non-published
  sub-crate. The CLI tool can answer various facts about regular
  expressions, such as printing its AST, its compiled byte code or its
  detected literals.

Closes #96, #188, #189
BurntSushi added a commit that referenced this issue Mar 27, 2016
The principle change in this commit is a complete rewrite of how
literals are detected from a regular expression. In particular, we now
traverse the abstract syntax to discover literals instead of the
compiled byte code. This permits more tuneable control over which and
how many literals are extracted, and is now exposed in the
`regex-syntax` crate so that others can benefit from it.

Other changes in this commit:

* The Boyer-Moore algorithm was rewritten to use my own concoction based
  on frequency analysis. We end up regressing on a couple benchmarks
  slightly because of this, but gain in some others and in general should
  be faster in a broader number of cases. (Principally because we try to
  run `memchr` on the rarest byte in a literal.) This should also greatly
  improve handling of non-Western text.
* A "reverse suffix" literal optimization was added. That is, if suffix
  literals exist but no prefix literals exist, then we can quickly scan
  for suffix matches and then run the DFA in reverse to find matches.
  (I'm not aware of any other regex engine that does this.)
* The mutex-based pool has been replaced with a spinlock-based pool
  (from the new `mempool` crate). This reduces some amount of constant
  overhead and improves several benchmarks that either search short
  haystacks or find many matches in long haystacks.
* Search parameters have been refactored.
* RegexSet can now contain 0 or more regular expressions (previously, it
  could only contain 2 or more). The InvalidSet error variant is now
  deprecated.
* A bug in computing start states was fixed. Namely, the DFA assumed the
  start states was always the first instruction, which is trivially
  wrong for an expression like `^☃$`. This bug persisted because it
  typically occurred when a literal optimization would otherwise run.
* A new CLI tool, regex-debug, has been added as a non-published
  sub-crate. The CLI tool can answer various facts about regular
  expressions, such as printing its AST, its compiled byte code or its
  detected literals.

Closes #96, #188, #189
BurntSushi added a commit that referenced this issue Mar 28, 2016
The principle change in this commit is a complete rewrite of how
literals are detected from a regular expression. In particular, we now
traverse the abstract syntax to discover literals instead of the
compiled byte code. This permits more tuneable control over which and
how many literals are extracted, and is now exposed in the
`regex-syntax` crate so that others can benefit from it.

Other changes in this commit:

* The Boyer-Moore algorithm was rewritten to use my own concoction based
  on frequency analysis. We end up regressing on a couple benchmarks
  slightly because of this, but gain in some others and in general should
  be faster in a broader number of cases. (Principally because we try to
  run `memchr` on the rarest byte in a literal.) This should also greatly
  improve handling of non-Western text.
* A "reverse suffix" literal optimization was added. That is, if suffix
  literals exist but no prefix literals exist, then we can quickly scan
  for suffix matches and then run the DFA in reverse to find matches.
  (I'm not aware of any other regex engine that does this.)
* The mutex-based pool has been replaced with a spinlock-based pool
  (from the new `mempool` crate). This reduces some amount of constant
  overhead and improves several benchmarks that either search short
  haystacks or find many matches in long haystacks.
* Search parameters have been refactored.
* RegexSet can now contain 0 or more regular expressions (previously, it
  could only contain 2 or more). The InvalidSet error variant is now
  deprecated.
* A bug in computing start states was fixed. Namely, the DFA assumed the
  start states was always the first instruction, which is trivially
  wrong for an expression like `^☃$`. This bug persisted because it
  typically occurred when a literal optimization would otherwise run.
* A new CLI tool, regex-debug, has been added as a non-published
  sub-crate. The CLI tool can answer various facts about regular
  expressions, such as printing its AST, its compiled byte code or its
  detected literals.

Closes #96, #188, #189
@BurntSushi
Copy link
Member Author

This was done in 31a317e. It isn't quite ambitious as what I had in mind, but it does indeed scan for suffix literals, and if one is found, runs the DFA in reverse to find the start location of the match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants