Coalesce adjacent loops in concatenation RegexNodes #1838

stephentoub · 2020-01-16T23:15:14Z

This augments the reduction phase of concatenation nodes to combine adjacent one/notone/setloops, e.g. a*a+a{1,2}b becomes a{2,}b (previously added optimizations will then see that the a loop can be made atomic and replace it with the equivalent of (?>a{2,})b). This has several benefits. First, it simplifies the node tree, creating less work for IR writer and less work for the interpreter/compiler. Second, it gives the compiler more opportunity to choose how the loop should be represented, when and how to unroll, etc. Third, it enables the auto-atomicity step to apply to more loops (as in the previous example). And most importantly, it can drastically reduce backtracking (especially with the atomicity optimization, but even without that). An expression like a*a*a*a*a*a*b run against an input like aaaaaaaaaaaaaa could previously take a very long time; now, it'll be very fast, e.g.

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;

public class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);

    private Regex _regex = new Regex(@"a*a*a*a*a*a*a*b", RegexOptions.Compiled);

    [Benchmark] public bool Test() => _regex.IsMatch("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
}

Method	Toolchain	Mean	Error	StdDev	Ratio
Test	\master\corerun.exe	465,480,480.0 ns	5,893,498.24 ns	5,512,781.91 ns	1.000
Test	\pr\corerun.exe	892.2 ns	4.68 ns	4.15 ns	0.000

(Note that this is lacking the anchoring optimization in #1706... once that's in, this example would be an order of magnitude even faster.)

Contributes to #1349
cc: @danmosemsft, @eerhardt, @pgovind, @ViktorHofer, @lpereira

@danmosemsft, FYI, I thought more about your suggestion and added some (limited) reflection-based tests.

safern · 2020-01-17T01:01:01Z

Installer Build and Test errors are: #1089 which we're trying to fix in: #1835

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs

This augments the reduction phase of concatenation nodes to combine adjacent one/notone/setloops, e.g. `a*a+a{1,2}b` becomes `a{2,}b` (previously added optimizations will then see that the a loop can be made atomic and replace it with the equivalent of `(?>a{2,})b`). This has several benefits. First, it simplifies the node tree, creating less work for IR writer and less work for the interpreter/compiler. Second, it gives the compiler more opportunity to choose how the loop should be represented, when and how to unroll, etc. Third, it enables the auto-atomicity step to apply to more loops (as in the previous example). And most importantly, it can drastically reduce backtracking (especially with the atomicity optimization, but even without that). An expression like `a*a*a*a*a*a*b` run against an input like `aaaaaaaaaaaaaa` could previously take a very long time; now, it'll be very fast.

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs

src/libraries/System.Text.RegularExpressions/tests/RegexReductionTests.cs

stephentoub added area-System.Text.RegularExpressions tenet-performance Performance related issue labels Jan 16, 2020

stephentoub added this to the 5.0 milestone Jan 16, 2020

pgovind reviewed Jan 17, 2020

View reviewed changes

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs Show resolved Hide resolved

stephentoub force-pushed the regexcoalesceloops branch from e667b6d to e659327 Compare January 17, 2020 02:52

Follow-up from previous PRs this is now rebased on

7da39ac

stephentoub force-pushed the regexcoalesceloops branch from e659327 to 7da39ac Compare January 17, 2020 04:15

danmoseley approved these changes Jan 17, 2020

View reviewed changes

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs Show resolved Hide resolved

danmoseley approved these changes Jan 17, 2020

View reviewed changes

src/libraries/System.Text.RegularExpressions/tests/RegexReductionTests.cs Show resolved Hide resolved

pgovind approved these changes Jan 17, 2020

View reviewed changes

stephentoub merged commit b9da725 into dotnet:master Jan 17, 2020

stephentoub deleted the regexcoalesceloops branch January 17, 2020 11:25

ghost locked as resolved and limited conversation to collaborators Dec 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coalesce adjacent loops in concatenation RegexNodes #1838

Coalesce adjacent loops in concatenation RegexNodes #1838

stephentoub commented Jan 16, 2020 •

edited

Loading

safern commented Jan 17, 2020

Coalesce adjacent loops in concatenation RegexNodes #1838

Coalesce adjacent loops in concatenation RegexNodes #1838

Conversation

stephentoub commented Jan 16, 2020 • edited Loading

safern commented Jan 17, 2020

stephentoub commented Jan 16, 2020 •

edited

Loading