Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More Regex perf improvements #1348

Merged
merged 30 commits into from
Jan 9, 2020
Merged

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Jan 7, 2020

Found some time over my vacation to play more with our regex implementation.

Primary changes here:

  • Auto-atomicification. Other regex libraries refer to this as auto-possessification, but .NET doesn't have a possessive quantifier, so I'm referring to it as atomicification instead :-), since possessive quantifiers are just shorthande for atomic groups. Backtracking is a huge source of cost in regex handling, in particular when matching fails, and it's really easy to write regular expressions that backtrack even when they don't have to. For example, the regex \d{1,2}\/\d{1,2}\/\d{2,4} as an approach to parsing a date... each of those repeaters will backtrack when a match fails, even though such backtracking is useless: since a / can never match \d, there's no way here that backtracking will help. A developer could instead write this using atomic groups, e.g. (?>\d{1,2})\/(?>\d{1,2})\/(?>\d{2,4}) but few developers think to do that. Instead, we can detect these conditions and automatically change the regex to make these atomic.
  • Primitives for atomic one, notone, and set loops. To reduce the amount of atomic nodes in the tree and to more easily communicate atomicity of these commonly found loops to the backend, I added atomic grouped version of the oneloop, notoneloop, and setloop primitives. These are then recognized in the backends. In addition to avoiding backtracking, we can generate better code even for the non-backtracking case. For example, prior to this PR an atomic node around a loop would result in the loop still outputting backtracking state, and then the atomic node would just cause that state to be skipped over by someone backtracking; we can instead for these primitives just avoid outputting that state to begin with.
  • Specialized code-gen for non-backtracking regexes. It turns out we generate a lot of boilerplate code to handle the possibility of backtracking, even if we never end up backtracking. And if you look at the code generated, it's very hard to follow / nothing like what you'd write if you were writing a custom implementation by hand. If we instead focus only on non-backtracking constructs, we can generate much better / smaller code, akin to what you'd write if you were manually writing code to match a regex by hand. I added a check to analyze the RegexNode tree to see whether we can support it, and then added a custom code path to generate the better code for these trees. This is helped significantly by the auto-atomicification, as it makes it more likely that a regex will be supported. This already handles a bunch of regexes, but additional work can be done to make it support more, e.g. the main thing it's missing that we can easily add is support for capture groups. As an example, the previously mentioned date regex \d{1,2}\/\d{1,2}\/\d{2,4}: with the code in master, the JIT ends up generating 2137 bytes of asm for the Go method, and with this PR, that shrinks to 758 bytes. As another example, consider a regex that just matches the string "abcdefghijklmnopqrstuvwxyz". With master, that results in a Go method that's 1204 bytes of asm, and just looking at the first six character checks for the unrolled loop portion, you can see why it's so much larger:
       6642837C410C61       cmp      word  ptr [rcx+2*r8+12], 97
       0F8527030000         jne      G_M56610_IG09
       448D4701             lea      r8d, [rdi+1]
       453BC1               cmp      r8d, r9d
       0F83F4030000         jae      G_M56610_IG17
       4D63C0               movsxd   r8, r8d
       6642837C410C62       cmp      word  ptr [rcx+2*r8+12], 98
       0F850A030000         jne      G_M56610_IG09
       448D4702             lea      r8d, [rdi+2]
       453BC1               cmp      r8d, r9d
       0F83D7030000         jae      G_M56610_IG17
       4D63C0               movsxd   r8, r8d
       6642837C410C63       cmp      word  ptr [rcx+2*r8+12], 99
       0F85ED020000         jne      G_M56610_IG09
       448D4703             lea      r8d, [rdi+3]
       453BC1               cmp      r8d, r9d
       0F83BA030000         jae      G_M56610_IG17
       4D63C0               movsxd   r8, r8d
       6642837C410C64       cmp      word  ptr [rcx+2*r8+12], 100
       0F85D0020000         jne      G_M56610_IG09
       448D4704             lea      r8d, [rdi+4]
       453BC1               cmp      r8d, r9d
       0F839D030000         jae      G_M56610_IG17
       4D63C0               movsxd   r8, r8d
       6642837C410C65       cmp      word  ptr [rcx+2*r8+12], 101
       0F85B3020000         jne      G_M56610_IG09
       448D4705             lea      r8d, [rdi+5]
       453BC1               cmp      r8d, r9d
       0F8380030000         jae      G_M56610_IG17
       4D63C0               movsxd   r8, r8d
       6642837C410C66       cmp      word  ptr [rcx+2*r8+12], 102
       0F8596020000         jne      G_M56610_IG09
       448D4706             lea      r8d, [rdi+6]
       453BC1               cmp      r8d, r9d
       0F8363030000         jae      G_M56610_IG17
       4D63C0               movsxd   r8, r8d

In contrast, with this PR, it's only 384 bytes, and the matching for the first six characters looks instead like this:

       6641833961           cmp      word  ptr [r9], 97
       0F8507010000         jne      G_M56613_IG10
       664183790262         cmp      word  ptr [r9+2], 98
       0F85FB000000         jne      G_M56613_IG10
       664183790463         cmp      word  ptr [r9+4], 99
       0F85EF000000         jne      G_M56613_IG10
       664183790664         cmp      word  ptr [r9+6], 100
       0F85E3000000         jne      G_M56613_IG10
       664183790865         cmp      word  ptr [r9+8], 101
       0F85D7000000         jne      G_M56613_IG10
       664183790A66         cmp      word  ptr [r9+10], 102
       0F85CB000000         jne      G_M56613_IG10
  • Vectorized searching. FindFirstChar often ends up in a loop looking for a character; we can use string.IndexOf instead. It's also common to have small sets that contain just two or three characters; we now use IndexOfAny for these. Such changes are helpful when the input is longer and the target characters infrequent. We now also use IndexOf in some cases in Go. For example, a subexpression like .* will end up searching for \n, and in the non-backtracking implementation we now use Span.IndexOf for that.
  • Singleton optimization. This is an optimization that converts a character class containing only a single character into a "one" node, which has optimized handling. I'd removed this optimization in a previous change erroneously, on the bad assumption that you'd only end up with one of these if someone manually created such a set in their regex... but the implementation itself creates these, and in common situations. I then went beyond reverting it and added special code-gen for it to EmitCallCharInClass.
  • Delayed JIT'ing. This is addressing a regression I introduced as part of a previous change to reduce allocations. In that change I got delegates from the DynamicMethods as part of constructing the compiled Regex, so as to avoid needing to allocate a new delegate every time we ran the regex and we couldn't use a cached instance. But in doing so, I ended up causing these methods to be JIT'd as part of constructing the Regex. This can hurt things like start-up, as we'd end up JIT'ing the code for a Regex created in a static cctor, but it can also reduce the parallelism possible if individual regexes were invoked for the first time concurrently.
  • EnsureStorage call. Every trip through the backtracking jump table, we were calling EnsureStorage, which checks the fields on the runner to see if the various arrays need to grow. But that means that before the call to EnsureStorage, we need to spill our locals out to the fields, so that it can examine them, and then after the call we need to reload those fields back into our locals. We can instead generate the checks that EnsureStorage does, but with our locals rather than with the fields, and then we only need to do the spilling if we actually need to grow the arrays.
  • Allocation reduction in RegexNode tree and RegexWriter. Most intermediate nodes in the tree only ever have 0 or 1 child, and for the 1 child case, we were creating a list presized to contain 4 elements. We now allow a single child to be stored directly, and only create the list when adding more than 1 child. Also, in RegexWriter, any operation in the RegexWriter that involves strings (strings for multi nodes, strings for sets, etc.) were storing those strings into both a dictionary and a list... turns out the latter was unnecessary and I got rid of it.
  • Tests. I re-enabled the netfx test configuration, to serve as an oracle and to help ensure we're not regressing. I also ported hundreds of test cases from mono, which it appears in turn ported from perl. This caught a few functional regressions introduced in some of the recent perf improvements, so I fixed those up.
  • Terminology. One thing that was really annoying me was terminology used in the implementation for atomic groups. For whatever reason, the implementation was referring to them as "Greedy" groups, which is then easily confused with the greediness of a loop like .* as opposed to a lazy one like .*?. I've renamed all of the relevant usage in the implementation from greedy to atomic, leaving the use of greedy in cases where it was referring to the greedy vs lazy case. I think we should also update the docs to refer to atomic groups / subexpressions rather than greedy groups / subexpressions, as the former is established terminology.

Other minor changes:

  • Codegen tweaks. For example, the JIT generates slightly better code for span[i] = x; i++; than it does for span[i++] = x;; since we're manually writing out the IL, we can just do it for the former instead of the latter.
  • Style fixes. Tried to keep these in separate commits.

Running the perf tests from dotnet/performance#1117 on .NET Core 3.1, master, and this PR:

Method Toolchain Options Mean Error StdDev Ratio Allocated
Email_IsMatch netcore31 None 623.02 ns 3.051 ns 2.704 ns 1.00 -
Email_IsMatch master None 468.31 ns 2.477 ns 2.068 ns 0.75 -
Email_IsMatch pr None 418.27 ns 1.983 ns 1.758 ns 0.67 -
Email_IsNotMatch netcore31 None 1,568.18 ns 11.984 ns 10.008 ns 1.00 -
Email_IsNotMatch master None 1,290.24 ns 10.016 ns 8.364 ns 0.82 -
Email_IsNotMatch pr None 754.01 ns 4.181 ns 3.911 ns 0.48 -
Date_IsMatch netcore31 None 444.95 ns 1.403 ns 1.095 ns 1.00 -
Date_IsMatch master None 195.94 ns 0.877 ns 0.777 ns 0.44 -
Date_IsMatch pr None 184.96 ns 0.622 ns 0.552 ns 0.42 -
Date_IsNotMatch netcore31 None 1,500.69 ns 7.065 ns 6.263 ns 1.00 -
Date_IsNotMatch master None 743.07 ns 2.210 ns 2.067 ns 0.50 -
Date_IsNotMatch pr None 590.19 ns 2.947 ns 2.757 ns 0.39 -
IP_IsMatch netcore31 None 2,802.70 ns 4.802 ns 4.256 ns 1.00 -
IP_IsMatch master None 1,999.82 ns 19.084 ns 16.918 ns 0.71 -
IP_IsMatch pr None 2,005.45 ns 7.103 ns 6.644 ns 0.72 -
IP_IsNotMatch netcore31 None 2,505.86 ns 10.129 ns 9.475 ns 1.00 -
IP_IsNotMatch master None 1,830.37 ns 7.060 ns 6.603 ns 0.73 -
IP_IsNotMatch pr None 1,858.30 ns 8.521 ns 7.971 ns 0.74 -
Uri_IsMatch netcore31 None 560.87 ns 2.793 ns 2.612 ns 1.00 -
Uri_IsMatch master None 290.85 ns 1.052 ns 0.878 ns 0.52 -
Uri_IsMatch pr None 281.03 ns 1.381 ns 1.291 ns 0.50 -
Uri_IsNotMatch netcore31 None 2,549.70 ns 14.587 ns 13.644 ns 1.00 -
Uri_IsNotMatch master None 1,307.79 ns 14.294 ns 11.936 ns 0.51 -
Uri_IsNotMatch pr None 1,135.40 ns 4.717 ns 4.412 ns 0.45 -
MatchesSet netcore31 None 485,601.08 ns 2,335.707 ns 1,950.421 ns 1.00 6704 B
MatchesSet master None 177,524.14 ns 922.807 ns 863.195 ns 0.37 6704 B
MatchesSet pr None 174,293.63 ns 999.666 ns 834.766 ns 0.36 6704 B
MatchesBoundary netcore31 None 393,404.51 ns 2,637.810 ns 2,202.690 ns 1.00 6704 B
MatchesBoundary master None 131,526.87 ns 736.618 ns 652.993 ns 0.33 6704 B
MatchesBoundary pr None 135,595.54 ns 1,538.993 ns 1,364.277 ns 0.34 6704 B
MatchesWord netcore31 None 3,639.83 ns 20.716 ns 18.364 ns 1.00 1488 B
MatchesWord master None 3,531.38 ns 16.601 ns 15.529 ns 0.97 1488 B
MatchesWord pr None 3,501.90 ns 15.131 ns 13.413 ns 0.96 1488 B
MatchesWords netcore31 None 104,433.48 ns 525.353 ns 491.415 ns 1.00 3512 B
MatchesWords master None 66,421.70 ns 253.156 ns 224.416 ns 0.64 3512 B
MatchesWords pr None 65,292.04 ns 305.344 ns 285.619 ns 0.63 3512 B
MatchWord netcore31 None 2,660.92 ns 16.474 ns 15.410 ns 1.00 208 B
MatchWord master None 1,732.68 ns 11.956 ns 10.599 ns 0.65 208 B
MatchWord pr None 1,780.01 ns 35.347 ns 31.334 ns 0.67 208 B
ReplaceWords netcore31 None 114,003.27 ns 2,144.486 ns 2,005.954 ns 1.00 9970 B
ReplaceWords master None 67,036.62 ns 506.997 ns 474.245 ns 0.59 9968 B
ReplaceWords pr None 66,631.49 ns 501.255 ns 468.875 ns 0.58 9968 B
SplitWords netcore31 None 107,249.50 ns 699.472 ns 654.287 ns 1.00 11152 B
SplitWords master None 66,851.06 ns 313.541 ns 277.946 ns 0.62 11152 B
SplitWords pr None 68,735.01 ns 356.457 ns 333.430 ns 0.64 11152 B
Ctor netcore31 None 6,103.89 ns 40.322 ns 37.718 ns 1.00 10712 B
Ctor master None 5,475.36 ns 58.453 ns 54.677 ns 0.90 9752 B
Ctor pr None 5,723.42 ns 26.869 ns 23.818 ns 0.94 7560 B
CtorInvoke netcore31 None 6,714.64 ns 38.538 ns 36.049 ns 1.00 13208 B
CtorInvoke master None 6,203.35 ns 39.082 ns 34.646 ns 0.92 12304 B
CtorInvoke pr None 6,710.59 ns 33.666 ns 31.491 ns 1.00 9856 B
Email_IsMatch netcore31 Compiled 401.20 ns 2.684 ns 2.511 ns 1.00 -
Email_IsMatch master Compiled 156.48 ns 0.652 ns 0.610 ns 0.39 -
Email_IsMatch pr Compiled 146.44 ns 0.600 ns 0.469 ns 0.37 -
Email_IsNotMatch netcore31 Compiled 760.27 ns 1.364 ns 1.139 ns 1.00 -
Email_IsNotMatch master Compiled 417.07 ns 1.628 ns 1.522 ns 0.55 -
Email_IsNotMatch pr Compiled 212.55 ns 1.259 ns 1.116 ns 0.28 -
Date_IsMatch netcore31 Compiled 353.99 ns 1.439 ns 1.346 ns 1.00 -
Date_IsMatch master Compiled 76.42 ns 0.438 ns 0.410 ns 0.22 -
Date_IsMatch pr Compiled 69.44 ns 0.164 ns 0.153 ns 0.20 -
Date_IsNotMatch netcore31 Compiled 1,102.17 ns 13.635 ns 12.754 ns 1.00 -
Date_IsNotMatch master Compiled 241.03 ns 2.021 ns 1.791 ns 0.22 -
Date_IsNotMatch pr Compiled 160.44 ns 0.971 ns 0.908 ns 0.15 -
IP_IsMatch netcore31 Compiled 1,206.32 ns 5.444 ns 4.826 ns 1.00 -
IP_IsMatch master Compiled 383.81 ns 1.530 ns 1.431 ns 0.32 -
IP_IsMatch pr Compiled 360.25 ns 2.052 ns 1.819 ns 0.30 -
IP_IsNotMatch netcore31 Compiled 1,150.44 ns 4.192 ns 3.922 ns 1.00 -
IP_IsNotMatch master Compiled 362.96 ns 2.240 ns 1.986 ns 0.32 -
IP_IsNotMatch pr Compiled 335.52 ns 2.375 ns 2.222 ns 0.29 -
Uri_IsMatch netcore31 Compiled 387.61 ns 0.343 ns 0.286 ns 1.00 -
Uri_IsMatch master Compiled 103.85 ns 0.447 ns 0.397 ns 0.27 -
Uri_IsMatch pr Compiled 94.90 ns 0.500 ns 0.468 ns 0.24 -
Uri_IsNotMatch netcore31 Compiled 1,752.44 ns 10.116 ns 9.462 ns 1.00 -
Uri_IsNotMatch master Compiled 324.73 ns 1.771 ns 1.656 ns 0.19 -
Uri_IsNotMatch pr Compiled 271.42 ns 0.939 ns 0.879 ns 0.15 -
MatchesSet netcore31 Compiled 384,247.92 ns 874.748 ns 775.442 ns 1.00 6705 B
MatchesSet master Compiled 62,538.17 ns 386.138 ns 361.193 ns 0.16 6704 B
MatchesSet pr Compiled 45,438.50 ns 377.098 ns 352.738 ns 0.12 6704 B
MatchesBoundary netcore31 Compiled 296,852.48 ns 501.091 ns 468.721 ns 1.00 6705 B
MatchesBoundary master Compiled 67,112.18 ns 296.617 ns 277.456 ns 0.23 6704 B
MatchesBoundary pr Compiled 54,094.37 ns 335.814 ns 297.691 ns 0.18 6704 B
MatchesWord netcore31 Compiled 1,548.42 ns 13.773 ns 11.501 ns 1.00 1488 B
MatchesWord master Compiled 1,539.23 ns 12.271 ns 10.878 ns 1.00 1488 B
MatchesWord pr Compiled 1,501.77 ns 13.623 ns 12.076 ns 0.97 1488 B
MatchesWords netcore31 Compiled 58,428.19 ns 264.911 ns 247.798 ns 1.00 3512 B
MatchesWords master Compiled 20,576.37 ns 128.042 ns 119.771 ns 0.35 3512 B
MatchesWords pr Compiled 11,337.64 ns 109.763 ns 102.672 ns 0.19 3512 B
MatchWord netcore31 Compiled 1,317.14 ns 7.897 ns 7.386 ns 1.00 208 B
MatchWord master Compiled 458.43 ns 3.379 ns 3.161 ns 0.35 208 B
MatchWord pr Compiled 306.76 ns 3.609 ns 3.014 ns 0.23 208 B
ReplaceWords netcore31 Compiled 59,539.32 ns 344.087 ns 321.859 ns 1.00 9968 B
ReplaceWords master Compiled 22,202.34 ns 100.450 ns 89.047 ns 0.37 9968 B
ReplaceWords pr Compiled 12,029.63 ns 173.355 ns 162.156 ns 0.20 9968 B
SplitWords netcore31 Compiled 59,326.27 ns 289.645 ns 256.762 ns 1.00 11152 B
SplitWords master Compiled 22,158.89 ns 86.230 ns 80.660 ns 0.37 11152 B
SplitWords pr Compiled 11,975.59 ns 171.859 ns 160.757 ns 0.20 11152 B
Ctor netcore31 Compiled 33,124.78 ns 249.868 ns 233.727 ns 1.00 36256 B
Ctor master Compiled 45,728.61 ns 606.260 ns 567.096 ns 1.38 34415 B
Ctor pr Compiled 26,362.50 ns 175.571 ns 155.639 ns 0.80 27976 B
CtorInvoke netcore31 Compiled 6,478,557.36 ns 58,365.438 ns 54,595.066 ns 1.00 43271 B
CtorInvoke master Compiled 4,980,893.08 ns 22,577.260 ns 20,014.152 ns 0.77 36825 B
CtorInvoke pr Compiled 4,887,560.49 ns 15,232.326 ns 13,503.060 ns 0.75 33329 B
Email_IsMatch netcore31 IgnoreCase, Compiled 524.15 ns 3.031 ns 2.835 ns 1.00 -
Email_IsMatch master IgnoreCase, Compiled 314.26 ns 0.355 ns 0.296 ns 0.60 -
Email_IsMatch pr IgnoreCase, Compiled 292.36 ns 1.502 ns 1.405 ns 0.56 -
Email_IsNotMatch netcore31 IgnoreCase, Compiled 1,036.26 ns 4.353 ns 4.072 ns 1.00 -
Email_IsNotMatch master IgnoreCase, Compiled 694.80 ns 4.730 ns 3.950 ns 0.67 -
Email_IsNotMatch pr IgnoreCase, Compiled 371.25 ns 2.063 ns 1.929 ns 0.36 -
Date_IsMatch netcore31 IgnoreCase, Compiled 445.45 ns 1.993 ns 1.865 ns 1.00 -
Date_IsMatch master IgnoreCase, Compiled 100.64 ns 0.325 ns 0.304 ns 0.23 -
Date_IsMatch pr IgnoreCase, Compiled 90.90 ns 0.681 ns 0.637 ns 0.20 -
Date_IsNotMatch netcore31 IgnoreCase, Compiled 1,290.41 ns 2.768 ns 2.453 ns 1.00 -
Date_IsNotMatch master IgnoreCase, Compiled 361.69 ns 0.650 ns 0.543 ns 0.28 -
Date_IsNotMatch pr IgnoreCase, Compiled 255.74 ns 1.706 ns 1.596 ns 0.20 -
IP_IsMatch netcore31 IgnoreCase, Compiled 1,752.59 ns 3.323 ns 2.945 ns 1.00 -
IP_IsMatch master IgnoreCase, Compiled 1,107.62 ns 5.786 ns 5.412 ns 0.63 -
IP_IsMatch pr IgnoreCase, Compiled 1,044.35 ns 5.011 ns 4.687 ns 0.60 -
IP_IsNotMatch netcore31 IgnoreCase, Compiled 1,693.27 ns 4.067 ns 3.605 ns 1.00 -
IP_IsNotMatch master IgnoreCase, Compiled 1,095.16 ns 6.315 ns 5.907 ns 0.65 -
IP_IsNotMatch pr IgnoreCase, Compiled 1,040.71 ns 4.134 ns 3.867 ns 0.61 -
Uri_IsMatch netcore31 IgnoreCase, Compiled 509.52 ns 3.097 ns 2.745 ns 1.00 -
Uri_IsMatch master IgnoreCase, Compiled 195.40 ns 0.830 ns 0.776 ns 0.38 -
Uri_IsMatch pr IgnoreCase, Compiled 194.83 ns 2.642 ns 2.342 ns 0.38 -
Uri_IsNotMatch netcore31 IgnoreCase, Compiled 2,253.63 ns 10.972 ns 9.727 ns 1.00 -
Uri_IsNotMatch master IgnoreCase, Compiled 936.70 ns 6.150 ns 5.452 ns 0.42 -
Uri_IsNotMatch pr IgnoreCase, Compiled 839.51 ns 7.670 ns 6.405 ns 0.37 -
MatchesSet netcore31 IgnoreCase, Compiled 431,135.91 ns 2,740.110 ns 2,563.100 ns 1.00 6705 B
MatchesSet master IgnoreCase, Compiled 152,123.57 ns 1,286.211 ns 1,203.122 ns 0.35 6704 B
MatchesSet pr IgnoreCase, Compiled 130,385.23 ns 619.477 ns 579.459 ns 0.30 6704 B
MatchesBoundary netcore31 IgnoreCase, Compiled 344,429.58 ns 1,807.146 ns 1,690.406 ns 1.00 6705 B
MatchesBoundary master IgnoreCase, Compiled 112,308.73 ns 623.921 ns 553.089 ns 0.33 6705 B
MatchesBoundary pr IgnoreCase, Compiled 110,686.87 ns 1,684.881 ns 1,576.039 ns 0.32 6704 B
MatchesWord netcore31 IgnoreCase, Compiled 7,405.59 ns 47.152 ns 44.106 ns 1.00 1488 B
MatchesWord master IgnoreCase, Compiled 7,137.93 ns 24.059 ns 20.090 ns 0.96 1488 B
MatchesWord pr IgnoreCase, Compiled 7,094.48 ns 74.635 ns 69.813 ns 0.96 1488 B
MatchesWords netcore31 IgnoreCase, Compiled 86,403.27 ns 525.438 ns 491.495 ns 1.00 3512 B
MatchesWords master IgnoreCase, Compiled 52,272.24 ns 357.391 ns 316.818 ns 0.61 3512 B
MatchesWords pr IgnoreCase, Compiled 45,188.72 ns 205.831 ns 182.463 ns 0.52 3512 B
MatchWord netcore31 IgnoreCase, Compiled 2,144.65 ns 4.196 ns 3.925 ns 1.00 208 B
MatchWord master IgnoreCase, Compiled 1,359.17 ns 8.709 ns 8.146 ns 0.63 208 B
MatchWord pr IgnoreCase, Compiled 1,183.68 ns 7.559 ns 7.070 ns 0.55 208 B
ReplaceWords netcore31 IgnoreCase, Compiled 86,858.44 ns 189.816 ns 177.554 ns 1.00 9968 B
ReplaceWords master IgnoreCase, Compiled 53,635.09 ns 275.973 ns 230.450 ns 0.62 9968 B
ReplaceWords pr IgnoreCase, Compiled 47,319.73 ns 281.218 ns 263.051 ns 0.54 9968 B
SplitWords netcore31 IgnoreCase, Compiled 88,568.31 ns 466.583 ns 436.442 ns 1.00 11152 B
SplitWords master IgnoreCase, Compiled 53,237.67 ns 418.119 ns 370.651 ns 0.60 11152 B
SplitWords pr IgnoreCase, Compiled 46,646.75 ns 868.290 ns 852.777 ns 0.53 11152 B
Ctor netcore31 IgnoreCase, Compiled 35,398.57 ns 369.582 ns 327.625 ns 1.00 37072 B
Ctor master IgnoreCase, Compiled 53,360.30 ns 474.193 ns 443.561 ns 1.51 36919 B
Ctor pr IgnoreCase, Compiled 31,597.71 ns 165.355 ns 146.583 ns 0.89 30032 B
CtorInvoke netcore31 IgnoreCase, Compiled 6,761,415.56 ns 81,006.588 ns 75,773.613 ns 1.00 44296 B
CtorInvoke master IgnoreCase, Compiled 5,628,468.33 ns 43,293.714 ns 40,496.967 ns 0.83 39313 B
CtorInvoke pr IgnoreCase, Compiled 5,432,093.33 ns 49,266.756 ns 46,084.154 ns 0.80 35696 B

I also tried out the regex redux code from the perf repo, but on the full input5000000.txt input rather than the much reduced one it uses. Results:

  • netcoreapp31: 4.92s
  • master: 3.04s
  • pr: 2.69s

There is still a lot more that can / should be done. I've opened an issue (#1349) to track that work, and to hopefully enlist help in working through the list, and also finding additional improvement opportunities.

cc: @danmosemsft, @ViktorHofer, @eerhardt, @davidwrighton

@stephentoub
Copy link
Member Author

@wfurt, several legs are failing here in a stack overflow in the NetworkInformation tests. Can you please take a look?

@wfurt
Copy link
Member

wfurt commented Jan 7, 2020

NetworkInformation seems busted cross the board. I will take look. As far as the OSX OpenSSL, this may be version mismatch discussed here #1129.

@danmoseley
Copy link
Member

@stephentoub there's several independent optimizations here -- would it make sense to merge commit this, rather than squash? If an issue occurs, it would be helpful to be able to bisect, assuming each commit is functional.

@stephentoub
Copy link
Member Author

would it make sense to merge commit this, rather than squash? If an issue occurs, it would be helpful to be able to bisect, assuming each commit is functional

Up to you. My default would be to squash, but we can merge if you like. I can try to clean up the commits to squash some of them individually and make it so that they're actually bisect-able, but I'm not sure how much work that'll be.

@danmoseley
Copy link
Member

Up to you. My default would be to squash, but we can merge if you like. I can try to clean up the commits to squash some of them individually and make it so that they're actually bisect-able, but I'm not sure how much work that'll be.

It's large enough that I think it would be worth not squashing, but I think that still has value even if you don't feel like spending more time making them bisectable. (It would just mean we would have to make them bisectable at that point, but we'd have a start on it.)

Copy link
Contributor

@pgovind pgovind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple comments.

@stephentoub stephentoub force-pushed the moreregexperf branch 2 times, most recently from 429d659 to ec7f4d3 Compare January 9, 2020 02:59
@stephentoub
Copy link
Member Author

stephentoub commented Jan 9, 2020

Just as a final example to highlight the differences in code gen...

Here's one of the regular expressions in the regex redux test: <[^>]*>

With master, the generated Go method decompiles into this:

string runtext = base.runtext;
int runtextstart = runtextstart;
int runtextbeg = runtextbeg;
int runtextend = base.runtextend;
int num = runtextpos;
int[] runtrack = base.runtrack;
int runtrackpos = base.runtrackpos;
int[] runstack = base.runstack;
int runstackpos = base.runstackpos;
runtrack[--runtrackpos] = num;
runtrack[--runtrackpos] = 0;
runstack[--runstackpos] = num;
runtrack[--runtrackpos] = 1;
int num4;
if (num<runtextend && runtext[num++] == '<')
{
	int num3;
	num4 = (num3 = runtextend - num) + 1;
	while (--num4 > 0)
	{
		if (runtext[num++] == '>')
		{
			num--;
			break;
		}
	}
	if (num3 > num4)
	{
		runtrack[--runtrackpos] = num3 - num4 - 1;
		runtrack[--runtrackpos] = num - 1;
		runtrack[--runtrackpos] = 2;
	}
	goto IL_00f2;
}
goto IL_0147;
IL_00f2:
if (num<runtextend && runtext[num++] == '>')
{
	num4 = runstack[runstackpos];
	Capture(0, num4, num);
	runtrack[--runtrackpos] = num4;
	runtrack[runtrackpos - 1] = 3;
	goto IL_013e;
}
goto IL_0147;
IL_0147:
while (true)
{
	base.runtrackpos = runtrackpos;
	base.runstackpos = runstackpos;
	EnsureStorage();
	runtrackpos = base.runtrackpos;
	runstackpos = base.runstackpos;
	runtrack = base.runtrack;
	runstack = base.runstack;
	switch (runtrack[runtrackpos++])
	{
	case 1:
		runstackpos++;
		continue;
	case 2:
		goto IL_01b8;
	case 3:
		runstack[--runstackpos] = runtrack[runtrackpos++];
		Uncapture();
		continue;
	}
	break;
}
num = runtrack[runtrackpos];
goto IL_013e;
IL_013e:
runtextpos = num;
return;
IL_01b8:
num = runtrack[runtrackpos++];
num4 = runtrack[runtrackpos++];
if (num4 > 0)
{
	runtrack[--runtrackpos] = num4 - 1;
	runtrack[--runtrackpos] = num - 1;
	runtrack[--runtrackpos] = 2;
}
goto IL_00f2;

With this PR, the generated Go method now decompiles into this:

string runtext = base.runtext;
int runtextpos;
int num = runtextpos = base.runtextpos;
ReadOnlySpan<char> readOnlySpan = runtext.AsSpan(num, runtextend - num);
if (0u < (uint)readOnlySpan.Length && readOnlySpan[0] == '<')
{
    int num2 = readOnlySpan.Slice(1).IndexOf<char>('>');
    if (num2 == -1)
    {
        num2 = readOnlySpan.Length - 1;
    }
    readOnlySpan = readOnlySpan.Slice(num2);
    num += num2;
    if (1u < (uint)readOnlySpan.Length && readOnlySpan[1] == '>')
    {
        base.runtextpos = num + 2;
        Capture(0, runtextpos, base.runtextpos);
    }
}

@danmoseley
Copy link
Member

That’s way more readable.. like I can actually read it

I'd previously removed multiple delegate allocations that were being incurred every time we needed to run a compiled regex concurrently.  In doing so, however, I inadvertently forced JIT'ing to occur when the Regex was created rather than on first use; that can in turn lead to more start-up overheads for regexes that are created on startup (e.g. to initialize a static field) but that aren't used at start-up, as well as reduce potential parallelism that could otherwise occur for distinct regexes being run for the first time concurrently.  This addresses the issue by still caching delegates, but in addition to rather than instead of the DynamicMethod objects.
I had removed it because it seemed illogical that anyone would write such character classes.  But it turns out the implementation/parser itself generates them for certain forms, e.g. "hello|hithere" will create a character class containing just 'h' in order to find the start of the alternation.
- Only use ilg from "macro" helpers
- Move some blocks of code around
Takes advantage of vectorization in string.IndexOf.
The JIT generates slightly better code for:
```C#
text[pos] = ....;
pos++;
```
instead of:
```C#
text[pos++] = ...;
```
so use the IL equivalent of the former.
Only spill if we have to grow the relevant arrays.
In an expression like A*B, backtracing may be involved even if it's impossible that it'll do any good, namely if there's no overlap between the A and B sets.  This backtracking can be avoided by using a greedy subexpression, e.g. (?>A*)B, but a) few developers think to do that, and b) even when they do the implementation has more overhead than is desirable.

This change does three things:
1. It introduces primitives for "oneloopgreedy" and "setloopgreedy" that correspond to the existing "oneloop" (a single character in a loop) and "setloop" (a char class in a loop) primitives.
2. It reduces a Greedy node that contains a one/setloop to instead just be a one/setloopgreedy node.
3. It finds cases of one/setloop concatenated with other things, and automatically updates those to be one/setloopgreedy when it can provie that the backtracking would have no benefit.

(2) removes a small amount of overhead, in particular in compiled, because it avoids needing to spit additional tracking state for backtracking.

(3) can likely be expanded in the future to catch more cases, but it handles the common cases now.  This has a very measurable impact on non-matches that would otherwise trigger backtracking into the subexpression.
When written as a* these are already transformed appropriately by the parser, but other reductions can expose them in a way that the parser won't see.  For example, (?:a)* ends up being a much more expensive regex even though it's identical functionally.  This reduces the latter to the former, for cases where a {Lazy}loop is found to directly wrap a Set/One/Notone.
When we can determine from the regex that backtracking isn't needed, we
can generate much smaller / better code in RegexCompiler.GenerateGo.
I initially mistakenly thought this wouldn't be useful, but it is.  For example, the expression ".*\n" can be made non-backtracing.
With all the changes being made in the regex implementation in this release, we want to make sure we don't regress, and a good way to help doing that is add a bunch of tests and ensure they pass on netfx as well, where appropriate.
The Boolean logic was previously wrong, leading to characters being included in the set even if they weren't meant to be, e.g. given the set [^bc], this should include everything other than b and c, but if invariant culture was being used, it incorrectly included B and C in the set.
The greedy naming is really confusing (we should fix it in the framework documentation as well).

In general regular expression nomenclature, "greedy" is used as the opposite of "lazy", indicating how much a loop initially consumes, e.g. does `a*` first try consuming as many 'a's as possible or does it first try consuming as few 'a's as possible... it's "greedy" (denoted as `a*`) if it consumes as many as possible and "lazy" (denoted as `a*?` if it consumes as few.

How aggressively it consumes, however, is orthogonal to whether it backtracks.  Whereas `a*` is greedy and backtracking and `a*?` is lazy and backtracking, `(?>a*)` is greedy and non-backtracking and `(?>a*?) is lazy and non-backtracking.

Unfortunately, the nomenclature in the implementation and the documentation describes the `(?> ... )` as being a "greedy subexpression", which then conflates the meaning of "greedy".

The rest of the industry refers to these instead as "atomic", so I've changed it to that in the implementation.
- In Go, Improve code generation from several constructs to reduce code size and increase bounds reduction in the non-backtracking compiler
- In FindFirstChars, use IndexOf instead of Boyer-Moore if the found prefix is only one character in length.
- Hide some debug-only functions from code coverage
@stephentoub
Copy link
Member Author

Merging with a merge commit per Dan's request

@stephentoub stephentoub merged commit 0434b87 into dotnet:master Jan 9, 2020
@stephentoub stephentoub deleted the moreregexperf branch January 9, 2020 10:42
@stephentoub stephentoub added the tenet-performance Performance related issue label Jan 12, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants