-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement backwards regex matching #96
Comments
I've been noodling on this for the past week. Supporting suffixes seems somewhat straight-forward assuming we get backwards search working. However, supporting the string-in-the-middle optimization is harder. In particular, we want to discover the set of all literal substrings such that at least one of them must appear in the search text in order for the regex to match. rsc gives an algebraic formulation here. But we need to go further. Say we have our set of literal substrings and we build an automaton to match them (via AC). So we run it on the search text and find a match. Now what? In rsc's algebraic formulation doesn't give us this information because it is algebraic. We instead need to perform the same operation on the regex instruction sequence, which forms a graph. Doing it this way seems a little simpler because we just need to find all paths from the start of the graph to the match state, then for each path, concatenate the literals. Unioning all of the concatenations should yield the set of required literals. (This process will of course need to be bounded, particularly around character classes!) |
@BurntSushi Sounds promising! I haven't read the codesearch code or rsc's Trigram Index post deeply enough to follow why the algebraic formulation can't give start and end locations, but it does seem like the solution you're outlining is simpler. One question about "... then for each path, concatenate the literals, ..." By literals, do you mean all |
The algebraic formulation is on the abstract syntax of a regex. The abstract syntax is, well, abstract. It's divorced from the automaton, which means extracting literals from it and the instruction at which that literal starts is impossible (or very hard).
Yeah, that's right. The key is that the The part I still have to iron out is how to bound it. e.g., |
(When I understand my approach better, I'll write out some simpler rules. But I think I have to play with code first.) |
Thanks, that clears things up.
Ah, interesting! I'm curious to see how you'll handle it. |
The principle change in this commit is a complete rewrite of how literals are detected from a regular expression. In particular, we now traverse the abstract syntax to discover literals instead of the compiled byte code. This permits more tuneable control over which and how many literals are extracted, and is now exposed in the `regex-syntax` crate so that others can benefit from it. Other changes in this commit: * The Boyer-Moore algorithm was rewritten to use my own concoction based on frequency analysis. We end up regressing on a couple benchmarks slightly because of this, but gain in some others and in general should be faster in a broader number of cases. (Principally because we try to run `memchr` on the rarest byte in a literal.) This should also greatly improve handling of non-Western text. * A "reverse suffix" literal optimization was added. That is, if suffix literals exist but no prefix literals exist, then we can quickly scan for suffix matches and then run the DFA in reverse to find matches. (I'm not aware of any other regex engine that does this.) * The mutex-based pool has been replaced with a spinlock-based pool (from the new `mempool` crate). This reduces some amount of constant overhead and improves several benchmarks that either search short haystacks or find many matches in long haystacks. * Search parameters have been refactored. * RegexSet can now contain 0 or more regular expressions (previously, it could only contain 2 or more). The InvalidSet error variant is now deprecated. * A bug in computing start states was fixed. Namely, the DFA assumed the start states was always the first instruction, which is trivially wrong for an expression like `^☃$`. This bug persisted because it typically occurred when a literal optimization would otherwise run. * A new CLI tool, regex-debug, has been added as a non-published sub-crate. The CLI tool can answer various facts about regular expressions, such as printing its AST, its compiled byte code or its detected literals. Closes #96, #188, #189
The principle change in this commit is a complete rewrite of how literals are detected from a regular expression. In particular, we now traverse the abstract syntax to discover literals instead of the compiled byte code. This permits more tuneable control over which and how many literals are extracted, and is now exposed in the `regex-syntax` crate so that others can benefit from it. Other changes in this commit: * The Boyer-Moore algorithm was rewritten to use my own concoction based on frequency analysis. We end up regressing on a couple benchmarks slightly because of this, but gain in some others and in general should be faster in a broader number of cases. (Principally because we try to run `memchr` on the rarest byte in a literal.) This should also greatly improve handling of non-Western text. * A "reverse suffix" literal optimization was added. That is, if suffix literals exist but no prefix literals exist, then we can quickly scan for suffix matches and then run the DFA in reverse to find matches. (I'm not aware of any other regex engine that does this.) * The mutex-based pool has been replaced with a spinlock-based pool (from the new `mempool` crate). This reduces some amount of constant overhead and improves several benchmarks that either search short haystacks or find many matches in long haystacks. * Search parameters have been refactored. * RegexSet can now contain 0 or more regular expressions (previously, it could only contain 2 or more). The InvalidSet error variant is now deprecated. * A bug in computing start states was fixed. Namely, the DFA assumed the start states was always the first instruction, which is trivially wrong for an expression like `^☃$`. This bug persisted because it typically occurred when a literal optimization would otherwise run. * A new CLI tool, regex-debug, has been added as a non-published sub-crate. The CLI tool can answer various facts about regular expressions, such as printing its AST, its compiled byte code or its detected literals. Closes #96, #188, #189
The principle change in this commit is a complete rewrite of how literals are detected from a regular expression. In particular, we now traverse the abstract syntax to discover literals instead of the compiled byte code. This permits more tuneable control over which and how many literals are extracted, and is now exposed in the `regex-syntax` crate so that others can benefit from it. Other changes in this commit: * The Boyer-Moore algorithm was rewritten to use my own concoction based on frequency analysis. We end up regressing on a couple benchmarks slightly because of this, but gain in some others and in general should be faster in a broader number of cases. (Principally because we try to run `memchr` on the rarest byte in a literal.) This should also greatly improve handling of non-Western text. * A "reverse suffix" literal optimization was added. That is, if suffix literals exist but no prefix literals exist, then we can quickly scan for suffix matches and then run the DFA in reverse to find matches. (I'm not aware of any other regex engine that does this.) * The mutex-based pool has been replaced with a spinlock-based pool (from the new `mempool` crate). This reduces some amount of constant overhead and improves several benchmarks that either search short haystacks or find many matches in long haystacks. * Search parameters have been refactored. * RegexSet can now contain 0 or more regular expressions (previously, it could only contain 2 or more). The InvalidSet error variant is now deprecated. * A bug in computing start states was fixed. Namely, the DFA assumed the start states was always the first instruction, which is trivially wrong for an expression like `^☃$`. This bug persisted because it typically occurred when a literal optimization would otherwise run. * A new CLI tool, regex-debug, has been added as a non-published sub-crate. The CLI tool can answer various facts about regular expressions, such as printing its AST, its compiled byte code or its detected literals. Closes #96, #188, #189
This was done in 31a317e. It isn't quite ambitious as what I had in mind, but it does indeed scan for suffix literals, and if one is found, runs the DFA in reverse to find the start location of the match. |
With @WillEngler working on the DFA, I'm itching to do more. One idea I got from
grep
(and RE2) is the capability of running automata backwards over the input.For example, today, we are very good at literal prefix matching because the prefixes get compiled down to a DFA. But what if the regex has a literal suffix? We don't do anything clever. What if we had the ability to run the engines backwards? We could identify suffixes just like prefixes, and then run the engine backwards from the suffix.
The possibilities get even better. What if the regex has neither a literal prefix nor a literal suffix, but has a string somewhere in the middle that must be matched? e.g.,
\w+foo\w+
. We could use a fast substring search to findfoo
and then, in theory, run the engine both backwards and forwards to find the start and ending locations.Finally, we'll want machine reversal to make the most out of a DFA.
The text was updated successfully, but these errors were encountered: