Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex optimization opportunities #1349

Closed
33 of 41 tasks
stephentoub opened this issue Jan 7, 2020 · 5 comments
Closed
33 of 41 tasks

Regex optimization opportunities #1349

stephentoub opened this issue Jan 7, 2020 · 5 comments
Assignees
Labels
area-System.Text.RegularExpressions help wanted [up-for-grabs] Good issue for external contributors needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration tenet-performance Performance related issue
Milestone

Comments

@stephentoub
Copy link
Member

stephentoub commented Jan 7, 2020

We’ve made some very measurable improvement to Regex post-.NET Core 3.1 (e.g. #271, #449, #542, #1348), but there’s still a lot of opportunity for incremental improvement (we can also consider a more foundational re-implementation if desired). Having spent some time over the holidays looking at the implementation, here’s a brain dump of some ideas that can be explored:

Vectorization / faster searching:

  • Vectorize and/or remove Boyer-Moore. We should explore whether a simple String/Span.IndexOf(char), which is now vectorized, will end up yielding greater benefits than the Boyer-Moore implementation currently employed. Alternatively, we should at least look at augmenting the implemention with a String/Span.IndexOf(char) to take advantage of its vectorization when searching for the next starting point. We can then also potentially add additional support for searching for multiple substrings in a vectorized manner. Rust’s regex engine does this with the Teddy algorithm.
  • Improve Boyer-Moore's prefix. If we keep Boyer-Moore, it can be significantly improved. In particular, to find the string that it looks for, the RegexFCD.Prefix method is called on the RegexNode tree to find the prefix string to search for, and as the comment on the method says “It's quite trivial and gives up easily.” For example, it’s common to see a regex like \bsomeword\b, but the moment RegexFCD.Prefix sees that \b boundary, it gives up.
  • Use Boyer-Moore for internal literals. We currently only look for strings starting the regex with which to use Boyer-Moore. But if the regex doesn't begin with a literal but has one internally that's always a fixed position from its start, e.g. [abc]def, we could still use Boyer-Moore with "def" and then back up by the offset for the actual matching engine.
  • Use Aho–Corasick for FindFirstChar. We currently use Boyer-Moore when there's a single prefix to be searched for, e.g. with a regex like "abcd(efg|hij)". But if we instead have a regex like "(efg|hij)abcd", we fall back to doing a search for "[eh]". With Aho-Corasick, we could instead do a search for both "efg" and "hij" concurrently, or even "efgabcd" and "hijabcd". It can also be vectorized to some extent.
  • Improve what we can vectorize in FindFirstChar. Given a regex like (hi|hello) there, we’ll do a vectorized search for h, but given [\w]+, we’ll iterate through the characters one-by-one doing a bit vector lookup on each character to see whether it’s a word character. We might be able to take advantage of some of the newer SSE/AVX instrinsics to do this faster.
  • Add notoneloop{Atomic} search to backtracking compiler. In the non-backtracking compiler, we use span.IndexOf to search for the character. We can do the same with string.IndexOf in the full-featured compiler.
  • Add setloop IndexOfAny search to non-backtracking compiler. When we encounter a notoneloopatomic in the non-backtracking compiler, we emit a IndexOf to search for the target character. We could do the same with IndexOfAny(char, char) and IndexOfAny(char, char, char) when we have a setloopatomic around a set with only two or three negated chars.
  • Unroll more or less. We currently unroll the comparison for a string, e.g. the text hello is matched with a series of generated comparisons, one for each character. In contrast, for repeaters, we code gen an actual loop. We should explore in what situations it might be beneficial to unroll (partially) a repeater as well, or conversely, whether there are cases where we’re better off generating a loop for strings rather than unrolling.
  • Use string/span.IndexOf{Any} methods in interpreter. As long as we can quickly determine that they should be used, it should in general be a win to use these methods from the interpreter in some of the same places we use them in the compiled implementation(s).
  • Compare more than one char at a time. In some cases, such as when validating a string, we compare one character at a time (even if it's via an unrolled loop). We should be able to read a larger unit and compare it against a larger unit, e.g. rather than reading a char and comparing it against 'n' and then another and comparing it against 'o', we could read an Int32 and compare it against something like 'n' << 16 | 'o'. We can do this with Int64 or even SIMD vectors.
  • IndexNotOf. Consider adding a vectorized IndexNotOf method (https://github.com/dotnet/corefx/issues/35594#issuecomment-572834317), and then use that to implement oneloop{atomic} as well as setloop{atomic} for a set containing just a few known characters. If we wanted to do this but weren't yet convinced about public API surface area, IndexNotOf could be added as an internal method in System.Text.RegularExpressions.dll, and then used from the in-memory DynamicMethod compiled regexes (but not from those produced by CompileToAssembly).
  • Improve vectorization of existing helpers. We use existing .NET searching methods now whenever possible to implement the various performed searches, e.g. IndexOf, IndexOfAny, etc. Most of these have some level of vectorization now, but considering how much time we can end up spending in the routines while executing regexes, it's worth spending more time optimizing them, e.g. Improve SpanHelpers.IndexOfAny throughput / vectorization #25023

Tree rewriting:

  • Factor out substrings from alternation. Consider a regex like (?:this|that) is great. This could be automatically rewritten as th(?:is|at) is great, which has several potential benefits: a) it exposes a prefix that the Boyer-Moore matching will see and be able to use in FindFirstChar, b) it reduces the amount of backtracking that’ll be necessary (backtracking to after rather than to before the "th"), and c) the string might be able to partake in other optimizations, such as being used to convert earlier loops into atomic ones.
  • Improve auto-atomicification. I added basic support to find backtracking constructs where backtracking would never help and to automatically make them atomic. That has three benefits: it makes failures to match much faster, it results in smaller code gen in general, and it enables our non-backtracking implementation to have a greater chance of kicking in. However, while the implementation covers a decent number of patterns, it leaves more on the table. For example, consider the regex '[^']*' which finds a pair of single quotes and everything between them. We’ll detect that the [^']* can’t possibly benefit from backtracking, because it’ll never match the character that’s required after this loop. However, if we had that exact same loop as part of an alternation, e.g. ('[^']*|something)', we won’t detect that. Currently the implementation only does two checks: it looks at successor nodes in a concatenation, and it looks at the last node in the regex that has no successors. Obviously there are plenty of other cases where analysis can quickly prove upgrading to an atomic grouping is correct. (Note that other implementations call this auto-possessification, named after a possessive quantifier. .NET doesn’t have a possessive quantifier, but a possessive quantifier is just shorthand for an atomic group, which .NET does have.)
  • Avoid unnecessary alternation backtracking. Consider a regex like (Hello|Hi), .*\. How are you\? If we feed this a string like Hello, steve. How are ya?, it’ll match the Hello, , but fail to match the rest… at that point, it’ll then start over and try with Hi, but there’s no point in doing so, as if Hello matched, Hi can’t possibly. We can detect this statically from the regex pattern, and then if every branch N has a required prefix that’s not a substring of every < N’s prefix, make the alternation atomic, e.g. wrap it in (?> … ).
  • Eliminate unnecessary alternation branches. We can statically determine if it's impossible for an alternation branch N + 1 to match after we've gotten to a certain place in matching branch N, e.g. given the alternation "abc|def" we know that once we match the 'a', there's no point in even considering the second branch. We should be able to utilize that knowledge to avoid unnecessarily checking branches when a previous one fails to match.
  • Make alternations atomic when possible. If we see that if any branch of the alternation matches it's impossible for another branch to match (e.g. if they all begin with different letters), then we can make the alternation atomic, as no amount of backtracking into the alternation after it's matched will help.
  • Optimize regexes starting with .*. If the RegexNode.FinalOptimize step sees that the Regex begins with a star loop (or .+), as long as it’s not part of an explicit capture or alternation, it can prepend a ^ anchor. That can avoid a ton of work as FindFirstChar will then return false on a subsequent call rather than bumping by 1 and going again. It might be easier to do this in the parser.
  • Merge more nodes in concatenation reduction. We already do a pass over concatenations to reduce the nodes and combine them where we can, e.g. combining five consecutive "one" nodes into a "multi". But we can do more, e.g. combining an expression like "[abcd]+[abcd]" into "[abcd]{2,}". This has multiple benefits. For one, it allows us more freedom in how we generate optimized code / handling of the tree. Second, it exposes additional optimization opportunity, e.g. if the expression begins with ..* and we combine that to be .+, we can then more easily see that the expression begins with a star loop that lets us automatically anchor it.
  • Better node searching for atomic overlap comparison check. We now attempt to convert non-atomic loops into atomic ones whenever possible. We do this by finding a loop and then looking at what comes after it and seeing if there's any overlap, e.g. with an expression [ab]*[cd], we'll see that there's no overlap between what can be matched by the loop and what comes next, and we'll make the loop atomic. We would similarly do that if the expression were [ab]*[cd]+, because we see that what comes after the first loop must occur at least once. However, if the expression were [ab]*[cd]*, we won't make the first loop atomic, because we see that the second loop has a min repeat count of 0 and thus isn't guaranteed to appear. We could be more aggressive here, and look at all possible subsequent nodes. For example, here we would see that there's nothing after the second loop in the expression, and so if the second pattern does appear, we won't overlap with it, and if the second pattern doesn't appear, we're at the end, so either way we can make it atomic. Similarly, if the expression were [ab]*[cd]*[ef], we would make both loops atomic; for the first, we'd see that it could be followed by either [cd] or [ef], and it doesn't overlap with either.

Better handling of specific subexpressions:

  • Optimize .*somestring. It’s pretty common to see a star loop followed by a string. Today’s implementation is very naïve: it dutifully searches for the \n, tries to compare the string there, then backs off by 1, tries to compare the string there, then backs off 1, tries to compare the string there, and so on. We can instead use an algorithm like Aho-Corasick to search back through the star loop, or even just a vectorized LastIndexOf for the string.
  • Add Capture support to the non-backtracking code gen implementation. This is one of the main things that keeps a non-trivial number of regexes from switching to using this faster implementation, primarily from what I’ve seen because developers don’t think about the fact that () is actually a capture group, and so something like foo(bar|zoo) which uses the parens to separate the alternation is actually also adding a capture (instead of the dev having written foo(?:bar|zoo)). Adding capture support to the non-backtracking codegen shouldn’t be hard, and it’ll allow more cases to switch over to that implementation.
  • Enlighten FindFirstChar about Bol anchor. The beginning of line anchor currently isn't used by FindFirstChars, but we should be able to use it to do a fast IndexOf search for a newline in order to position the search at the next viable location. This would make the ^ anchor much faster when RegexOptions.Multiline is used (though it's not clear how common it is to use it).
  • Reordering checks in codegen. We currently always evaluate the match in order, e.g. for a pattern "a\sb", we'll check the 'a', then the whitespace, and then the 'b'. But that's not actually required. For fixed-size regions like this, we can choose to reorder the checks and do the cheapest ones first, and/or do first the ones most likely to not match. For example, in this expression, we might want to do the 'b' check before the whitespace check, because the single comparison for the 'b' is going to be faster than the char.IsWhitespace for the \s. As another example, if the expression were "a\sz", all else being equal, we're probably less likely to see 'z' in text being matched than we are 'a', so if we had the choice to do the 'z' check first, it might be wise to do so.

Character class matching:

  • Improve handling of Unicode characters with negative sets. Consider a set like [^"]. This matches every character except a quote. Today we generate an ASCII bitmap for each set, but then for inputs >= 128, if it’s possible a Unicode character could match, we codegen a fallback that calls RegexCharClass.CharInClass. However, with a negation like this one, we know that not only might some Unicode characters match, we know that every Unicode character is going to match. So, instead of generating a fallback that calls RegexCharClass.CharInClass, we can just generate a fallback that returns true. This will not only significantly speed-up the handling of Unicode characters, it’ll also result in smaller code, which should help overall performance. For example, today we end up generating code along the lines of if ((uint)char < 128) result = lookup[char]; else result = CharInClass(char);; for a set like this, we can instead generate something like result = (uint)char >= 128 || lookup[char];.
  • Improve handling of sets with Unicode categories. Consider a set with a single Unicode category, like \d. Today we generate a lookup table for ASCII, and then fallback to calling CharInClass (in actuality, '\d` has special handling). We can instead just generate a call to char.GetUnicodeCategory and compare its category to the desired const. We might not even need to generate the fast-path lookup table for this case, as GetUnicodeCategory already has a fast-path lookup table for ASCII, but we’d need to measure. Then consider a set like \w. It maps to multiple Unicode categories. Instead of having the fast-path lookup table and then a fallback that calls CharInClass, we could have the fallback be a call to UnicodeCategory.GetUnicodeCategory, and codegen the comparison against the known Unicode categories we care about.
  • Explore using larger lookup tables than 128. For example, would doubling the size to all characters < 256 help in enough additional cases to be worthwhile?
  • Built-in helper inlining. For known sets like \s, we now call out to a helper method, e.g. char.IsWhitespace. we should validate those calls are getting inlined.

General overhead:

  • Avoid two virtual calls plus two delegate calls per character. Each call to FindFirstChar is a virtual call, and it invokes the DynamicMethod via a delegate. If it matches, we then make a virtual call to Go, and it again calls the generated code via a delegate. That means if we frequently find a starting character and it doesn’t match the regex, we end up paying four virtual/delegate invocations for the character on top of the actual matching logic. We could change the codegen for FindFirstChar to also do the work of Go (even if it just calls Go, though better would be actually integrated); at that point Go can be a nop and would only be invoked by the base loop when there was actually a match.
  • Avoid duplicate work between FindFirstChar and Go. Similar / overlapping with the previous case, we currently duplicate work between FindFirstChar and Go in some cases. For example, a simple regex like hello will first validate the match in FindFirstChar and then redo the match in Go, even though we know it already matched. In such a case, we should be able to effectively make Go a nop.
  • Avoid unnecessary RegexWriter usage. We currently do the non-backtracking check after we’re already in the process of reflection emitting code. If we were to do it earlier, we could avoid using RegexWriter at all, since its result is just ignored if we take the non-backtracking path.
  • Reduce allocations specific to backtracking. RegexRunner initializes several int[]s for backtracking-related state, but if we end up employing the non-backtracking codegenerator, these are wasted. We can avoid creating them at all if they won't be used. While the runner gets reused for serial use, if a Regex instance is used concurrently (which happens implicitly when using the static Regex methods), threads that don't get the shared runner will end up creating their own temporary instance, and each of those will end up incurring these unnecessary allocations.
  • Better bounds-check elimination. We don't currently generate code in a way that will optimally eliminate bounds checks, e.g. we're not eliminating the bounds checks on the character class strings we output for fast lookups. Since we're doing code generation and spitting out IL directly, we could choose to just use IL-based opcodes to manipulate refs directly and avoid bounds checking altogether. (I tried eliminating all bounds checking and it made very little difference.)
  • Add a minimum length optimization. We already walk the RegexNode tree as part of the check to see whether we can use the non-backtracking implementation; as part of that (or separate from it) we can easily compute a minimum bound on the length of any matching string, e.g. foo[abc]{2,4} has a minimum length of 5. FindFirstChar can check upfront to see whether the input string is at least that long, and if it’s not, immediately fail.
  • Replace when length == replacement length. We currently do an analysis to determine the minimum length a regex can match. It would not be hard to also do an analysis for the maximum length a regex can match. If those values are the same, then every match would be a specific length, and if that length is the same length as a string used in Regex.Replace as the replacement, we can be more efficient about how we do the replacement. We currently build up a list of segments (text/start/count) and then create a new string and copy everything into it. But if we know the length of the output string will be identical to the length of the input string, regardless of how many replacements happen (because replacements won't change the length), then we can create the string and do the replace directly into it, rather than building up the intermediate list of segments.
  • Replace falling back to string.Replace. We could analyze a regex to determine that it contains simple text (e.g. a single character or a "multi"), and in such cases for some of the overloads fall back to just calling string.Replace.

Fundamental engine changes / replacements:

  • Employ a form of tiered compilation. If someone doesn't specify RegexOptions.Compiled, we could consider tracking usage of the Regex instance, and when it hits a certain threshold, asynchronously upgrade it to being compiled. We'd queue the compilation of the expression, and then backpatch the instance with the upgraded runner, such that all future invocations would use the compiled version instead. The upside to this is faster execution when RegexOptions.Compiled should have been specified but wasn't. There are a few potential downsides. We'd need to always carry around the compiler in a trimmed app, even if the compiler wouldn't otherwise be used. We'd potentially waste time/space compiling a regex that wasn't actually important. And we'd risk subtle discrepancies that show up over time if we ever have a bug that results in different execution semantics between non-compiled and compiled.
  • Try to unify the multiple go method code generators. I added the non-backtracking implementation because lots of regexes don’t actually need the backtracking, and it adds a non-trivial amount of overhead. We might, however, be able to be smarter about how we implement things, and make it pay-for-play, with a single implementation that only adds in the backtracking code if a construct requires it. We partially have that with the two implementations today, where we check to see if we can use one and then fallback to the other, so the goal here would be to see if we can a) minimize some of the duplication, and b) maintain the non-backtracking codegen benefits in more cases.
  • PCRE light-up. Just as we analyze the RegexNode tree to see whether we can use our non-backtracking implementation, we could analyze it to see whether we could use PCRE, and do so if it’s dynamically found on the machine. The .NET APIs support features PCRE doesn’t, so it would only work if a subset of functionality was required.
  • Consider using memoization to reduce backtracking costs. This can help avoid the huge costs of backtracking, and the expense of significant memory consumption. Some implementations balance this by having a max cache size and evicting from the cache when it's full and new states are added. Essentially for each backtracking construct in the expression (e.g. a loop), we track all positions in the input string for which no match was found, and then before backtracking we check to see if we've tried that position before. Worst case, this is O(expression_length * input_length) space, but optimizations are possible on top of that, e.g. storing ranges instead of all individual items, only storing X results that are most impactful, etc.
  • Consider using a DFA implementation for some regexes. Other implementations do this, with multiple engines employed based on the target expression.

cc: @danmosemsft, @eerhardt, @ViktorHofer, @davidwrighton

@stephentoub stephentoub added this to the 5.0 milestone Jan 7, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Jan 7, 2020
@stephentoub
Copy link
Member Author

cc: @pgovind

@stephentoub stephentoub removed the untriaged New issue has not been triaged by the area owner label Jan 7, 2020
@danmoseley danmoseley added the help wanted [up-for-grabs] Good issue for external contributors label Jan 7, 2020
@danmoseley
Copy link
Member

We use Boyer-Moore or whatever to find the prefix, then start the engine from that point. If the regex had a literal suffix, which was not in an alternation or anything of that sort, could we (a) verify that literal suffix exists and/or (b) avoid greedily consuming any of the suffix? It appears that Rust searches for literal suffixes in some circumstances.

@stephentoub
Copy link
Member Author

could we (a) verify that literal suffix exists

Yes, though it doesn't need to just be a suffix; it could be any character or characters guaranteed to be in any matching string. However, it's not clear to me that's always going to be a net win. The benefit is that we could avoid checking more complicated parts of the pattern if that check failed, but such a check also isn't free. It's easier to justify if there are no variable length regions before the substring being searched for, as then you can directly index to look for it at the right offset, but at that point you're also validating the pattern, and you might be better off just reorganizing how the go method implementation works to check for that first, and also do the work item(s) I've mentioned of decreasing the overhead of switching back and forth between FindFirstChar and Go.

b) avoid greedily consuming any of the suffix

I'm not sure what you mean by this.

@danmoseley
Copy link
Member

danmoseley commented Feb 6, 2020

Here are 3 regexes we have seen be high cost in our own services, with your notes @stephentoub :

  •  ^(?![;\\s])[^\\p{Cc}]+(?<!\\s)$
    
    • The [^\p{Cc}] set will currently compile down to a lookup table for ASCII and a > 127 fallback call to CharInClass, e.g. bool result = !(ch < 127 ? (lookup[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass(“…”, ch));
    • We would need to measure, but we would probably be better off seeing that the the set has only a single Unicode category, and generating a direct call to handle it, e.g. bool result = CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.Control; If it turns out that’s not actually faster for the fast path, it would probably at least be smaller code than the CharInClass call for the fallback path. This would just be modifying the EmitMatchCharacterClass function in RegexCompiler to special-case when a char class has a single Unicode category.
  •   LinePrefix:\s*([A-Za-z0-9#,]*) 
    
    • We’ll currently make that second loop into an atomic one, as it’s at the end, but we won’t convert the first one to be atomic, which means there could be unnecessary backtracking. We won’t convert it today, even though we can see that there’s no overlap between the \s and the [A-Za-z0-9#,] set, because the latter set has a minimum loop count of 0, and we don’t look past it.
    • We could (should?) teach our logic to check all possible subsequent nodes, in which case it would see that this subsequent node has a min count of 0 and that there was nothing after it, and thus knowing that these two sets didn’t overlap, it would happily make the first loop atomic. This is actually probably a worthwhile thing to do, but it’s not super easy. I’d guestimate it’s probably a day's work.
  •   (\b(([A-Z]([a-z/]){0,50})|([0-9])*)[\.]?s?([\+]|([\.])|\b))
    
    • We won’t do a good job today generating FindFirstChar for this expression. That |\b at the end will confound the analyzer that looks for all starting characters, and it causes it to fail and return nothing, which means FindFirstChar ends up basically doing nothing and just always returning true. We should be able to factor that logic into the analyzer and come up with something more useful that allows FindFirstChar to do a better job at skipping to a good location at which to run the engine.
    • I added another code generator that generates much nicer code than the existing generator, but I stopped short of adding backtracking support to it, and the alternations in this expression lead to backtracking and thus we won’t generate the nicer code. It’d be nice (but probably non-trivial work) to add in the backtracking support. At that point, the main thing the new generator wouldn’t support would be right-to-left, which also includes things built on top of right-to-left, namely lookbehinds.

@stephentoub
Copy link
Member Author

We've implemented most of this, some of these aren't worth pursuing, we have existing issues for most of the rest, and I've just opened a few remaining issues. As such, I'm going to close this.

MichalStrehovsky added a commit to MichalStrehovsky/runtime that referenced this issue Dec 9, 2021
We force include the default constructor of a type whenever an EEType is created mostly to make sure `Activator.CreateInstance<T>` is going to work in dynamically created types. I think we should be able to get by without it, but we should at least start by not including the method as a reflectable method.

IL Linker doesn't guarantee `CreateInstance<T>` is going to work unless dataflow annotations match (CreateInstance annoying doesn't have a `new()` constraint but IL Linker requires that whatever is used in place of the T is either a concrete type or a T that has `new()` or a `DynamicallyAccessedMembers.PublicParameterlessConstructor` annotation).
@ghost ghost locked as resolved and limited conversation to collaborators Jan 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Text.RegularExpressions help wanted [up-for-grabs] Good issue for external contributors needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants