Regex optimization opportunities #1349

stephentoub · 2020-01-07T05:39:20Z

We’ve made some very measurable improvement to Regex post-.NET Core 3.1 (e.g. #271, #449, #542, #1348), but there’s still a lot of opportunity for incremental improvement (we can also consider a more foundational re-implementation if desired). Having spent some time over the holidays looking at the implementation, here’s a brain dump of some ideas that can be explored:

Vectorization / faster searching:

Tree rewriting:

Factor out substrings from alternation. Consider a regex like (?:this|that) is great. This could be automatically rewritten as th(?:is|at) is great, which has several potential benefits: a) it exposes a prefix that the Boyer-Moore matching will see and be able to use in FindFirstChar, b) it reduces the amount of backtracking that’ll be necessary (backtracking to after rather than to before the "th"), and c) the string might be able to partake in other optimizations, such as being used to convert earlier loops into atomic ones.
Improve auto-atomicification. I added basic support to find backtracking constructs where backtracking would never help and to automatically make them atomic. That has three benefits: it makes failures to match much faster, it results in smaller code gen in general, and it enables our non-backtracking implementation to have a greater chance of kicking in. However, while the implementation covers a decent number of patterns, it leaves more on the table. For example, consider the regex '[^']*' which finds a pair of single quotes and everything between them. We’ll detect that the [^']* can’t possibly benefit from backtracking, because it’ll never match the character that’s required after this loop. However, if we had that exact same loop as part of an alternation, e.g. ('[^']*|something)', we won’t detect that. Currently the implementation only does two checks: it looks at successor nodes in a concatenation, and it looks at the last node in the regex that has no successors. Obviously there are plenty of other cases where analysis can quickly prove upgrading to an atomic grouping is correct. (Note that other implementations call this auto-possessification, named after a possessive quantifier. .NET doesn’t have a possessive quantifier, but a possessive quantifier is just shorthand for an atomic group, which .NET does have.)
Avoid unnecessary alternation backtracking. Consider a regex like (Hello|Hi), .*\. How are you\? If we feed this a string like Hello, steve. How are ya?, it’ll match the Hello, , but fail to match the rest… at that point, it’ll then start over and try with Hi, but there’s no point in doing so, as if Hello matched, Hi can’t possibly. We can detect this statically from the regex pattern, and then if every branch N has a required prefix that’s not a substring of every < N’s prefix, make the alternation atomic, e.g. wrap it in (?> … ).
Eliminate unnecessary alternation branches. We can statically determine if it's impossible for an alternation branch N + 1 to match after we've gotten to a certain place in matching branch N, e.g. given the alternation "abc|def" we know that once we match the 'a', there's no point in even considering the second branch. We should be able to utilize that knowledge to avoid unnecessarily checking branches when a previous one fails to match.
Make alternations atomic when possible. If we see that if any branch of the alternation matches it's impossible for another branch to match (e.g. if they all begin with different letters), then we can make the alternation atomic, as no amount of backtracking into the alternation after it's matched will help.
Optimize regexes starting with .*. If the RegexNode.FinalOptimize step sees that the Regex begins with a star loop (or .+), as long as it’s not part of an explicit capture or alternation, it can prepend a ^ anchor. That can avoid a ton of work as FindFirstChar will then return false on a subsequent call rather than bumping by 1 and going again. It might be easier to do this in the parser.
Merge more nodes in concatenation reduction. We already do a pass over concatenations to reduce the nodes and combine them where we can, e.g. combining five consecutive "one" nodes into a "multi". But we can do more, e.g. combining an expression like "[abcd]+[abcd]" into "[abcd]{2,}". This has multiple benefits. For one, it allows us more freedom in how we generate optimized code / handling of the tree. Second, it exposes additional optimization opportunity, e.g. if the expression begins with ..* and we combine that to be .+, we can then more easily see that the expression begins with a star loop that lets us automatically anchor it.
Better node searching for atomic overlap comparison check. We now attempt to convert non-atomic loops into atomic ones whenever possible. We do this by finding a loop and then looking at what comes after it and seeing if there's any overlap, e.g. with an expression [ab]*[cd], we'll see that there's no overlap between what can be matched by the loop and what comes next, and we'll make the loop atomic. We would similarly do that if the expression were [ab]*[cd]+, because we see that what comes after the first loop must occur at least once. However, if the expression were [ab]*[cd]*, we won't make the first loop atomic, because we see that the second loop has a min repeat count of 0 and thus isn't guaranteed to appear. We could be more aggressive here, and look at all possible subsequent nodes. For example, here we would see that there's nothing after the second loop in the expression, and so if the second pattern does appear, we won't overlap with it, and if the second pattern doesn't appear, we're at the end, so either way we can make it atomic. Similarly, if the expression were [ab]*[cd]*[ef], we would make both loops atomic; for the first, we'd see that it could be followed by either [cd] or [ef], and it doesn't overlap with either.

Better handling of specific subexpressions:

Optimize .*somestring. It’s pretty common to see a star loop followed by a string. Today’s implementation is very naïve: it dutifully searches for the \n, tries to compare the string there, then backs off by 1, tries to compare the string there, then backs off 1, tries to compare the string there, and so on. We can instead use an algorithm like Aho-Corasick to search back through the star loop, or even just a vectorized LastIndexOf for the string.
Add Capture support to the non-backtracking code gen implementation. This is one of the main things that keeps a non-trivial number of regexes from switching to using this faster implementation, primarily from what I’ve seen because developers don’t think about the fact that () is actually a capture group, and so something like foo(bar|zoo) which uses the parens to separate the alternation is actually also adding a capture (instead of the dev having written foo(?:bar|zoo)). Adding capture support to the non-backtracking codegen shouldn’t be hard, and it’ll allow more cases to switch over to that implementation.
Enlighten FindFirstChar about Bol anchor. The beginning of line anchor currently isn't used by FindFirstChars, but we should be able to use it to do a fast IndexOf search for a newline in order to position the search at the next viable location. This would make the ^ anchor much faster when RegexOptions.Multiline is used (though it's not clear how common it is to use it).
Reordering checks in codegen. We currently always evaluate the match in order, e.g. for a pattern "a\sb", we'll check the 'a', then the whitespace, and then the 'b'. But that's not actually required. For fixed-size regions like this, we can choose to reorder the checks and do the cheapest ones first, and/or do first the ones most likely to not match. For example, in this expression, we might want to do the 'b' check before the whitespace check, because the single comparison for the 'b' is going to be faster than the char.IsWhitespace for the \s. As another example, if the expression were "a\sz", all else being equal, we're probably less likely to see 'z' in text being matched than we are 'a', so if we had the choice to do the 'z' check first, it might be wise to do so.

Character class matching:

Improve handling of Unicode characters with negative sets. Consider a set like [^"]. This matches every character except a quote. Today we generate an ASCII bitmap for each set, but then for inputs >= 128, if it’s possible a Unicode character could match, we codegen a fallback that calls RegexCharClass.CharInClass. However, with a negation like this one, we know that not only might some Unicode characters match, we know that every Unicode character is going to match. So, instead of generating a fallback that calls RegexCharClass.CharInClass, we can just generate a fallback that returns true. This will not only significantly speed-up the handling of Unicode characters, it’ll also result in smaller code, which should help overall performance. For example, today we end up generating code along the lines of if ((uint)char < 128) result = lookup[char]; else result = CharInClass(char);; for a set like this, we can instead generate something like result = (uint)char >= 128 || lookup[char];.
Improve handling of sets with Unicode categories. Consider a set with a single Unicode category, like \d. Today we generate a lookup table for ASCII, and then fallback to calling CharInClass (in actuality, '\d` has special handling). We can instead just generate a call to char.GetUnicodeCategory and compare its category to the desired const. We might not even need to generate the fast-path lookup table for this case, as GetUnicodeCategory already has a fast-path lookup table for ASCII, but we’d need to measure. Then consider a set like \w. It maps to multiple Unicode categories. Instead of having the fast-path lookup table and then a fallback that calls CharInClass, we could have the fallback be a call to UnicodeCategory.GetUnicodeCategory, and codegen the comparison against the known Unicode categories we care about.
Explore using larger lookup tables than 128. For example, would doubling the size to all characters < 256 help in enough additional cases to be worthwhile?
Built-in helper inlining. For known sets like \s, we now call out to a helper method, e.g. char.IsWhitespace. we should validate those calls are getting inlined.

General overhead:

Avoid two virtual calls plus two delegate calls per character. Each call to FindFirstChar is a virtual call, and it invokes the DynamicMethod via a delegate. If it matches, we then make a virtual call to Go, and it again calls the generated code via a delegate. That means if we frequently find a starting character and it doesn’t match the regex, we end up paying four virtual/delegate invocations for the character on top of the actual matching logic. We could change the codegen for FindFirstChar to also do the work of Go (even if it just calls Go, though better would be actually integrated); at that point Go can be a nop and would only be invoked by the base loop when there was actually a match.
Avoid duplicate work between FindFirstChar and Go. Similar / overlapping with the previous case, we currently duplicate work between FindFirstChar and Go in some cases. For example, a simple regex like hello will first validate the match in FindFirstChar and then redo the match in Go, even though we know it already matched. In such a case, we should be able to effectively make Go a nop.
Avoid unnecessary RegexWriter usage. We currently do the non-backtracking check after we’re already in the process of reflection emitting code. If we were to do it earlier, we could avoid using RegexWriter at all, since its result is just ignored if we take the non-backtracking path.
Reduce allocations specific to backtracking. RegexRunner initializes several int[]s for backtracking-related state, but if we end up employing the non-backtracking codegenerator, these are wasted. We can avoid creating them at all if they won't be used. While the runner gets reused for serial use, if a Regex instance is used concurrently (which happens implicitly when using the static Regex methods), threads that don't get the shared runner will end up creating their own temporary instance, and each of those will end up incurring these unnecessary allocations.
Better bounds-check elimination. We don't currently generate code in a way that will optimally eliminate bounds checks, e.g. we're not eliminating the bounds checks on the character class strings we output for fast lookups. Since we're doing code generation and spitting out IL directly, we could choose to just use IL-based opcodes to manipulate refs directly and avoid bounds checking altogether. (I tried eliminating all bounds checking and it made very little difference.)
Add a minimum length optimization. We already walk the RegexNode tree as part of the check to see whether we can use the non-backtracking implementation; as part of that (or separate from it) we can easily compute a minimum bound on the length of any matching string, e.g. foo[abc]{2,4} has a minimum length of 5. FindFirstChar can check upfront to see whether the input string is at least that long, and if it’s not, immediately fail.
Replace when length == replacement length. We currently do an analysis to determine the minimum length a regex can match. It would not be hard to also do an analysis for the maximum length a regex can match. If those values are the same, then every match would be a specific length, and if that length is the same length as a string used in Regex.Replace as the replacement, we can be more efficient about how we do the replacement. We currently build up a list of segments (text/start/count) and then create a new string and copy everything into it. But if we know the length of the output string will be identical to the length of the input string, regardless of how many replacements happen (because replacements won't change the length), then we can create the string and do the replace directly into it, rather than building up the intermediate list of segments.
Replace falling back to string.Replace. We could analyze a regex to determine that it contains simple text (e.g. a single character or a "multi"), and in such cases for some of the overloads fall back to just calling string.Replace.

Fundamental engine changes / replacements:

Employ a form of tiered compilation. If someone doesn't specify RegexOptions.Compiled, we could consider tracking usage of the Regex instance, and when it hits a certain threshold, asynchronously upgrade it to being compiled. We'd queue the compilation of the expression, and then backpatch the instance with the upgraded runner, such that all future invocations would use the compiled version instead. The upside to this is faster execution when RegexOptions.Compiled should have been specified but wasn't. There are a few potential downsides. We'd need to always carry around the compiler in a trimmed app, even if the compiler wouldn't otherwise be used. We'd potentially waste time/space compiling a regex that wasn't actually important. And we'd risk subtle discrepancies that show up over time if we ever have a bug that results in different execution semantics between non-compiled and compiled.
Try to unify the multiple go method code generators. I added the non-backtracking implementation because lots of regexes don’t actually need the backtracking, and it adds a non-trivial amount of overhead. We might, however, be able to be smarter about how we implement things, and make it pay-for-play, with a single implementation that only adds in the backtracking code if a construct requires it. We partially have that with the two implementations today, where we check to see if we can use one and then fallback to the other, so the goal here would be to see if we can a) minimize some of the duplication, and b) maintain the non-backtracking codegen benefits in more cases.
PCRE light-up. Just as we analyze the RegexNode tree to see whether we can use our non-backtracking implementation, we could analyze it to see whether we could use PCRE, and do so if it’s dynamically found on the machine. The .NET APIs support features PCRE doesn’t, so it would only work if a subset of functionality was required.
Consider using memoization to reduce backtracking costs. This can help avoid the huge costs of backtracking, and the expense of significant memory consumption. Some implementations balance this by having a max cache size and evicting from the cache when it's full and new states are added. Essentially for each backtracking construct in the expression (e.g. a loop), we track all positions in the input string for which no match was found, and then before backtracking we check to see if we've tried that position before. Worst case, this is O(expression_length * input_length) space, but optimizations are possible on top of that, e.g. storing ranges instead of all individual items, only storing X results that are most impactful, etc.
Consider using a DFA implementation for some regexes. Other implementations do this, with multiple engines employed based on the target expression.

cc: @danmosemsft, @eerhardt, @ViktorHofer, @davidwrighton

The text was updated successfully, but these errors were encountered:

stephentoub · 2020-01-07T05:45:36Z

cc: @pgovind

danmoseley · 2020-01-14T01:20:47Z

We use Boyer-Moore or whatever to find the prefix, then start the engine from that point. If the regex had a literal suffix, which was not in an alternation or anything of that sort, could we (a) verify that literal suffix exists and/or (b) avoid greedily consuming any of the suffix? It appears that Rust searches for literal suffixes in some circumstances.

stephentoub · 2020-01-14T01:38:43Z

could we (a) verify that literal suffix exists

Yes, though it doesn't need to just be a suffix; it could be any character or characters guaranteed to be in any matching string. However, it's not clear to me that's always going to be a net win. The benefit is that we could avoid checking more complicated parts of the pattern if that check failed, but such a check also isn't free. It's easier to justify if there are no variable length regions before the substring being searched for, as then you can directly index to look for it at the right offset, but at that point you're also validating the pattern, and you might be better off just reorganizing how the go method implementation works to check for that first, and also do the work item(s) I've mentioned of decreasing the overhead of switching back and forth between FindFirstChar and Go.

b) avoid greedily consuming any of the suffix

I'm not sure what you mean by this.

danmoseley · 2020-02-06T19:47:59Z

Here are 3 regexes we have seen be high cost in our own services, with your notes @stephentoub :

```
 ^(?![;\\s])[^\\p{Cc}]+(?<!\\s)$
```
- The [^\p{Cc}] set will currently compile down to a lookup table for ASCII and a > 127 fallback call to CharInClass, e.g. bool result = !(ch < 127 ? (lookup[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass(“…”, ch));
- We would need to measure, but we would probably be better off seeing that the the set has only a single Unicode category, and generating a direct call to handle it, e.g. bool result = CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.Control; If it turns out that’s not actually faster for the fast path, it would probably at least be smaller code than the CharInClass call for the fallback path. This would just be modifying the EmitMatchCharacterClass function in RegexCompiler to special-case when a char class has a single Unicode category.
```
  LinePrefix:\s*([A-Za-z0-9#,]*) 
```
- We’ll currently make that second loop into an atomic one, as it’s at the end, but we won’t convert the first one to be atomic, which means there could be unnecessary backtracking. We won’t convert it today, even though we can see that there’s no overlap between the \s and the [A-Za-z0-9#,] set, because the latter set has a minimum loop count of 0, and we don’t look past it.
- We could (should?) teach our logic to check all possible subsequent nodes, in which case it would see that this subsequent node has a min count of 0 and that there was nothing after it, and thus knowing that these two sets didn’t overlap, it would happily make the first loop atomic. This is actually probably a worthwhile thing to do, but it’s not super easy. I’d guestimate it’s probably a day's work.
```
  (\b(([A-Z]([a-z/]){0,50})|([0-9])*)[\.]?s?([\+]|([\.])|\b))
```
- We won’t do a good job today generating FindFirstChar for this expression. That |\b at the end will confound the analyzer that looks for all starting characters, and it causes it to fail and return nothing, which means FindFirstChar ends up basically doing nothing and just always returning true. We should be able to factor that logic into the analyzer and come up with something more useful that allows FindFirstChar to do a better job at skipping to a good location at which to run the engine.
- I added another code generator that generates much nicer code than the existing generator, but I stopped short of adding backtracking support to it, and the alternations in this expression lead to backtracking and thus we won’t generate the nicer code. It’d be nice (but probably non-trivial work) to add in the backtracking support. At that point, the main thing the new generator wouldn’t support would be right-to-left, which also includes things built on top of right-to-left, namely lookbehinds.

stephentoub · 2021-12-06T17:57:31Z

We've implemented most of this, some of these aren't worth pursuing, we have existing issues for most of the rest, and I've just opened a few remaining issues. As such, I'm going to close this.

We force include the default constructor of a type whenever an EEType is created mostly to make sure `Activator.CreateInstance<T>` is going to work in dynamically created types. I think we should be able to get by without it, but we should at least start by not including the method as a reflectable method. IL Linker doesn't guarantee `CreateInstance<T>` is going to work unless dataflow annotations match (CreateInstance annoying doesn't have a `new()` constraint but IL Linker requires that whatever is used in place of the T is either a concrete type or a T that has `new()` or a `DynamicallyAccessedMembers.PublicParameterlessConstructor` annotation).

stephentoub added area-System.Text.RegularExpressions tenet-performance Performance related issue labels Jan 7, 2020

stephentoub added this to the 5.0 milestone Jan 7, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Jan 7, 2020

stephentoub mentioned this issue Jan 7, 2020

More Regex perf improvements #1348

Merged

stephentoub removed the untriaged New issue has not been triaged by the area owner label Jan 7, 2020

danmoseley added the help wanted [up-for-grabs] Good issue for external contributors label Jan 7, 2020

devatwork mentioned this issue Jan 8, 2020

Patterns with very large minoccurs crash bwrrp/xspattern.js#11

Open

This was referenced Jan 10, 2020

Add minimum length optimization to RegexCompiler #1624

Merged

Add basic Capture support to non-backtracking RegexCompiler #1639

Merged

stephentoub mentioned this issue Jan 24, 2020

Factor common prefix text out of Regex alternations #2171

Merged

This was referenced Feb 6, 2020

Use char.GetUnicodeCategory in RegexCompiler for single Unicode categories #31881

Merged

Use IndexOf for .* in RegexInterpreter/Compiler #31930

Merged

Special-case two set loops in RegexCompiler #32003

Merged

stephentoub mentioned this issue Apr 5, 2020

Optimize newline handling for RegexOptions.Multiline #34566

Merged

danmoseley modified the milestones: 5.0.0, Future Jun 18, 2020

danmoseley mentioned this issue Aug 12, 2020

Intrinsicify SpanHelpers.IndexOfAny(char,char) #40589

Closed

pgovind mentioned this issue Dec 7, 2020

WIP: Explore using Aho-Corasick for specific patterns in Regex #45697

Closed

jeffhandley assigned jeffhandley and pgovind Jan 20, 2021

pgovind mentioned this issue Apr 19, 2021

Eliminate backtracking in the interpreter for patterns with .* #51508

Merged

stephentoub mentioned this issue Sep 27, 2021

Augment Regex extensibility point for better perf and span-based matching #59629

Closed

stephentoub mentioned this issue Oct 19, 2021

Add RegexOptions.NonBacktracking #60607

Merged

ghost added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Dec 6, 2021

stephentoub closed this as completed Dec 6, 2021

ghost locked as resolved and limited conversation to collaborators Jan 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex optimization opportunities #1349

Regex optimization opportunities #1349

stephentoub commented Jan 7, 2020 •

edited

Loading

stephentoub commented Jan 7, 2020

danmoseley commented Jan 14, 2020

stephentoub commented Jan 14, 2020

danmoseley commented Feb 6, 2020 •

edited by stephentoub

Loading

stephentoub commented Dec 6, 2021

Regex optimization opportunities #1349

Regex optimization opportunities #1349

Comments

stephentoub commented Jan 7, 2020 • edited Loading

stephentoub commented Jan 7, 2020

danmoseley commented Jan 14, 2020

stephentoub commented Jan 14, 2020

danmoseley commented Feb 6, 2020 • edited by stephentoub Loading

stephentoub commented Dec 6, 2021

stephentoub commented Jan 7, 2020 •

edited

Loading

danmoseley commented Feb 6, 2020 •

edited by stephentoub

Loading