Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factor common prefix text out of Regex alternations #2171

Merged
merged 3 commits into from
Jan 27, 2020

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Jan 24, 2020

Given a regex like "this|that|there", we will now factor out the common prefix into a new node concatenated with the alternation, e.g. "th(?:is|at|ere)". This has a few benefits, including exposing more text to FindFirstChar if this is at the beginning of the sequence, reducing backtracking, and enabling further reduction/optimization opportunities in the alternation.

For example, in this trivial microbenchmark, this|that will be refactored into th(?:is|at), which will cause FindFirstChar to generate a Boyer-Moore search for "th" instead of just looking for 't', and it'll cause Go to generate a check for "th" before the alternation, such that it won't re-check "th" when it has to backtrack and try the second branch of the alternation.

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;

[MemoryDiagnoser]
public class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);

    private Regex _regex = new Regex(@"this|that", RegexOptions.Compiled);

    [Benchmark] public bool Test() => _regex.IsMatch("now is the time");
}
Method Toolchain Mean Ratio
Test \master\corerun.exe 60.27 ns 1.00
Test \pr\corerun.exe 46.22 ns 0.78

Without whitespace diff: https://github.com/dotnet/runtime/pull/2171/files?w=1

Contributes to #1349
cc: @danmosemsft, @eerhardt, @pgovind, @ViktorHofer

@stephentoub stephentoub added this to the 5.0 milestone Jan 24, 2020
Given a regex like "this|that|there", we will now factor out the common prefix into a new node concatenated with the alternation, e.g. "th(?:is|at|ere)".  This has a few benefits, including exposing more text to FindFirstChar if this is at the beginning of the sequence, reducing backtracking, and enabling further reduction/optimization opportunities in the alternation.
Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@stephentoub stephentoub merged commit 9023aa2 into dotnet:master Jan 27, 2020
@stephentoub stephentoub deleted the alternaterefactor branch January 27, 2020 19:56
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants