Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Regex.EnumerateMatches #67794

Merged
merged 4 commits into from
Apr 13, 2022
Merged

Conversation

joperezr
Copy link
Member

@joperezr joperezr commented Apr 9, 2022

Fixes #65011
Fixes #23602

Adding EnumerateMatches method which returns an enumerator that can iterate over the matches in a passed-in span. The operation is performed amortized allocation free.

@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost ghost assigned joperezr Apr 9, 2022
@ghost
Copy link

ghost commented Apr 9, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #65011
Fixes #23602

Adding EnumerateMatches method which returns an enumerator that can iterate over the matches in a passed-in span. The operation is performed amortized allocation free.

Author: joperezr
Assignees: -
Labels:

area-System.Text.RegularExpressions, new-api-needs-documentation

Milestone: -

@joperezr
Copy link
Member Author

joperezr commented Apr 9, 2022

Here is a quick benchmark I wrote to see how this compares with the existing way to iterate over a MatchCollection using Regex.Matches:

    // regex pattern used is "\b\w+\b" and the input is loremIpsum 5 paragraph string.

    [Benchmark(Baseline = true)]
    public int MatchCollection()
    {
        int x = 0;
        for (int i = 0; i < 1000; i++)
        {
            foreach (Match match in regex.Matches(loremIpsum))
            {
                if (match.ValueSpan[0] >= 'a' && match.ValueSpan[0] <= 'z')
                    x++;
            }
        }

        return x;
    }

    [Benchmark]
    public int MatchEnuemrator()
    {
        int x = 0;
        ReadOnlySpan<char> span = loremIpsum.AsSpan();
        for (int i = 0; i < 1000; i++)
        {
            foreach (ValueMatch word in regex.EnumerateMatches(span))
            {
                if (span.Slice(word.Index, word.Length)[0] >= 'a' && span.Slice(word.Index, word.Length)[0] <= 'z')
                    x++;
            }
        }

        return x;
    }

And the results are:

Method Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Allocated
MatchCollection 208.5 ms 3.97 ms 4.72 ms 1.00 0.00 43000.0000 1000.0000 273,544,480 B
MatchEnuemrator 146.0 ms 2.71 ms 2.67 ms 0.70 0.02 - - -

@stephentoub
Copy link
Member

@olsaarik, in the current NonBacktracking code, it would benefit from knowing that indexes are needed but not captures. Will that still be the case after your upcoming fixes?

@stephentoub stephentoub merged commit 89daf96 into dotnet:main Apr 13, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants