Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use StartsWith(..., OrdinalIgnoreCase) in RegexCompiler / source generator #66339

Merged
merged 2 commits into from
Mar 10, 2022

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Mar 8, 2022

Fixes #66324
Depends on #66095
Depends on #61048 (we have a partial temp solution in place, but that will provide the full one)

When we encounter a sequence of sets representing case-insensitive ASCII, we can simplify the code generated to just call StartsWith, which both makes it more readable but also takes advantage of the new JIT optimization to lower that into efficient vectorized comparisons based on the supplied literal.

This also cleans up some formatting in the source generator emitted code to make things much more concise and less noisy.

Example:
In http://\w+.com with RegexOptions.IgnoreCase, the generated code for the "http://" part had looked like:

if ((uint)slice.Length < 7 ||
    ((slice[0] | 0x20) != 'h') || // Match a character in the set [Hh].
    ((slice[1] | 0x20) != 't') || // Match a character in the set [Tt] exactly 2 times.
    ((slice[2] | 0x20) != 't') ||
    ((slice[3] | 0x20) != 'p')) // Match a character in the set [Pp].
{
    return false; // The input didn't match.
}
                    
// Match the string "://".
{
    if (!global::System.MemoryExtensions.StartsWith(slice.Slice(4), "://"))
    {
        return false; // The input didn't match.
    }
}

and with this PR looks like:

if ((uint)slice.Length < 7 ||
    !global::System.MemoryExtensions.StartsWith(slice, "http", global::System.StringComparison.OrdinalIgnoreCase) || // Match the string "http" (ordinal case-insensitive)
    !global::System.MemoryExtensions.StartsWith(slice.Slice(4), "://")) // Match the string "://".
{
    return false; // The input didn't match.
}

I've not measured perf yet and will wait for that until #66095 is merged.

@stephentoub stephentoub added this to the 7.0.0 milestone Mar 8, 2022
@ghost ghost assigned stephentoub Mar 8, 2022
@ghost
Copy link

ghost commented Mar 8, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #66324
Depends on #66095
Depends on #61048 (we have a partial temp solution in place, but that will provide the full one)

When we encounter a sequence of sets representing case-insensitive ASCII, we can simplify the code generated to just call StartsWith, which both makes it more readable but also takes advantage of the new JIT optimization to lower that into efficient vectorized comparisons based on the supplied literal.

This also cleans up some formatting in the source generator emitted code to make things much more concise and less noisy.

Example:
In http://\w+.com, the generated code for the "http://" part had looked like:

if ((uint)slice.Length < 7 ||
    ((slice[0] | 0x20) != 'h') || // Match a character in the set [Hh].
    ((slice[1] | 0x20) != 't') || // Match a character in the set [Tt] exactly 2 times.
    ((slice[2] | 0x20) != 't') ||
    ((slice[3] | 0x20) != 'p')) // Match a character in the set [Pp].
{
    return false; // The input didn't match.
}
                    
// Match the string "://".
{
    if (!global::System.MemoryExtensions.StartsWith(slice.Slice(4), "://"))
    {
        return false; // The input didn't match.
    }
}

and with this PR looks like:

if ((uint)slice.Length < 7 ||
    !global::System.MemoryExtensions.StartsWith(slice, "http", global::System.StringComparison.OrdinalIgnoreCase) || // Match the string "http" (ordinal case-insensitive)
    !global::System.MemoryExtensions.StartsWith(slice.Slice(4), "://")) // Match the string "://".
{
    return false; // The input didn't match.
}
Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions, tenet-performance

Milestone: 7.0.0

@MihaZupan
Copy link
Member

Could it instead generate

// Match the string "http://".
if (!global::System.MemoryExtensions.StartsWith(slice, "http://", global::System.StringComparison.OrdinalIgnoreCase))
{
    return false; // The input didn't match.
}

And let MemoryExtensions.StartsWith deal with how to properly handle http vs ://?

@stephentoub
Copy link
Member Author

Could it instead generate

It easily could. Happy to change that if @EgorBo says StartsWith will do the right thing and this will definitively be better.

@joperezr
Copy link
Member

joperezr commented Mar 8, 2022

My question more or less follows along the same lines that @MihaZupan question was doing. Is the main reason why in your example it doesn't look for the whole http:// in StartsWith call due to the fact of us having Set's of 2 chars (which only vary by case) followed by a Multi? If I understand correctly, what is happening in here is that with your code we correctly concatenate when we see several pairs of these sets together, but shouldn't it be also possible to further optimize that and say that if all of these sets are followed by a multi we should try to concatenate that too?

@stephentoub
Copy link
Member Author

Is the main reason why in your example it doesn't look for the whole http:// in StartsWith call due to the fact of us having Set's of 2 chars (which only vary by case) followed by a Multi?

Yes

If I understand correctly, what is happening in here is that with your code we correctly concatenate when we see several pairs of these sets together, but shouldn't it be also possible to further optimize that and say that if all of these sets are followed by a multi we should try to concatenate that too?

It's easy. I just didn't under the assumption that it would be worse. If Egor tells me it'll be better, it'll be a couple of lines to switch it.

@EgorBo
Copy link
Member

EgorBo commented Mar 8, 2022

Could it instead generate

It easily could. Happy to change that if @EgorBo says StartsWith will do the right thing and this will definitively be better.

I'd say it should 🙂
image
(it uses 2 64bit loads because http:// is not long enough to take the SIMD path - it's 7 chars)

#66095 handles 0-32 chars strings. But I'm planning to file a follow-up PR to extend that to 64 (or even 128) for hot blocks.

…rator

When we encounter a sequence of sets representing case-insensitive ASCII, we can simplify the code generated to just call StartsWith, which both makes it more readable but also takes advantage of the new JIT optimization to lower that into efficient vectorized comparisons based on the supplied literal.

This also cleans up some formatting in the source generator emitted code to make things much more concise and less noisy.
@danmoseley
Copy link
Member

Presumably it can omit the length check now - at least, in the case where there's only one call to StartsWith?

@stephentoub
Copy link
Member Author

Presumably it can omit the length check now - at least, in the case where there's only one call to StartsWith?

I've left it in because it's more common than not to have a series of calls all guarded by the same length check, and it wasn't worth special-casing just the case where the only thing guarded is a single StartsWith. I would hope in that case the JIT gets rid of the redundant length check.

@stephentoub stephentoub merged commit 28580bf into dotnet:main Mar 10, 2022
@stephentoub stephentoub deleted the regexstartswithcase branch March 10, 2022 19:11
@stephentoub
Copy link
Member Author

I didn't actually see any perf wins from this locally; we'll see if the perf lab disagrees. However, I didn't see any regressions, either, which means we can get the simpler/smaller/more readable code for the same throughput, which is a win, so I went ahead and merged it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use StartsWith literal optimization for OrdinalIgnoreCase in RegexCompiler / source generator
5 participants