Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coalesce adjacent loops in concatenation RegexNodes #1838

Merged
merged 2 commits into from
Jan 17, 2020

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Jan 16, 2020

This augments the reduction phase of concatenation nodes to combine adjacent one/notone/setloops, e.g. a*a+a{1,2}b becomes a{2,}b (previously added optimizations will then see that the a loop can be made atomic and replace it with the equivalent of (?>a{2,})b). This has several benefits. First, it simplifies the node tree, creating less work for IR writer and less work for the interpreter/compiler. Second, it gives the compiler more opportunity to choose how the loop should be represented, when and how to unroll, etc. Third, it enables the auto-atomicity step to apply to more loops (as in the previous example). And most importantly, it can drastically reduce backtracking (especially with the atomicity optimization, but even without that). An expression like a*a*a*a*a*a*b run against an input like aaaaaaaaaaaaaa could previously take a very long time; now, it'll be very fast, e.g.

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;

public class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);

    private Regex _regex = new Regex(@"a*a*a*a*a*a*a*b", RegexOptions.Compiled);

    [Benchmark] public bool Test() => _regex.IsMatch("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
}
Method Toolchain Mean Error StdDev Ratio
Test \master\corerun.exe 465,480,480.0 ns 5,893,498.24 ns 5,512,781.91 ns 1.000
Test \pr\corerun.exe 892.2 ns 4.68 ns 4.15 ns 0.000

(Note that this is lacking the anchoring optimization in #1706... once that's in, this example would be an order of magnitude even faster.)

Contributes to #1349
cc: @danmosemsft, @eerhardt, @pgovind, @ViktorHofer, @lpereira

@danmosemsft, FYI, I thought more about your suggestion and added some (limited) reflection-based tests.

@stephentoub stephentoub added this to the 5.0 milestone Jan 16, 2020
@safern
Copy link
Member

safern commented Jan 17, 2020

Installer Build and Test errors are: #1089 which we're trying to fix in: #1835

This augments the reduction phase of concatenation nodes to combine adjacent one/notone/setloops, e.g. `a*a+a{1,2}b` becomes `a{2,}b` (previously added optimizations will then see that the a loop can be made atomic and replace it with the equivalent of `(?>a{2,})b`).  This has several benefits.  First, it simplifies the node tree, creating less work for IR writer and less work for the interpreter/compiler.  Second, it gives the compiler more opportunity to choose how the loop should be represented, when and how to unroll, etc.  Third, it enables the auto-atomicity step to apply to more loops (as in the previous example).  And most importantly, it can drastically reduce backtracking (especially with the atomicity optimization, but even without that).  An expression like `a*a*a*a*a*a*b` run against an input like `aaaaaaaaaaaaaa` could previously take a very long time; now, it'll be very fast.
@stephentoub stephentoub merged commit b9da725 into dotnet:master Jan 17, 2020
@stephentoub stephentoub deleted the regexcoalesceloops branch January 17, 2020 11:25
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants