-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SearchValues for non-ASCII Regex fallback #89140
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsWhen we encounter a set for which we can't use one of the IndexOfXx variants, we use IndexOfAny with SearchValues. If the whole set is ASCII, this is easy. But if there are any non-ASCII characters in the set, we currently emit a helper method that first does a search for all of the ASCII values or anything that's non-ASCII, and then if something non-ASCII is found, it proceeds to do a scalar scan matching every character. Now that SearchValues has a vectorized implementation of a probabilistic map, we can instead fall back to that when the set isn't too large and thus can be fully enumerated into a SearchValues instance. We still do this as a two-step process, as searching for the ASCII subset is measurably faster than the probabilistic map search. However, if we see that there's no ASCII at all in the target set, we can then skip the helper entirely and just do the IndexOfAny with the SearchValues for the non-ASCII targets. For example, previously: [GeneratedRegex(@"[a-z]", RegexOptions.IgnoreCase)] would result in emitting this helper (since it's not only inclusive of A-Z and a-z, but also the Kelvin sign) /// <summary>Finds the next index of any character that matches a character in the set [A-Za-z\u212A].</summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static int IndexOfNonAsciiOrAny_0E8811570F3718B3BBF9E5001539AEB605A6A55AB8C5DCF3ECD130597FA78CEC(this ReadOnlySpan<char> span)
{
int i = span.IndexOfAnyExcept(Utilities.s_asciiExceptLetters);
if ((uint)i < (uint)span.Length)
{
if (char.IsAscii(span[i]))
{
return i;
}
char ch;
do
{
if (((ch = span[i]) < 128 ? char.IsAsciiLetter(ch) : RegexRunner.CharInClass((char)ch, "\0\u0006\0A[a{KÅ")))
{
return i;
}
i++;
}
while ((uint)i < (uint)span.Length);
}
return -1;
}
/// <summary>Supports searching for characters in or not in "\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./0123456789:;<=>?@[\\]^_`{|}~\u007f".</summary>
internal static readonly SearchValues<char> s_asciiExceptLetters = SearchValues.Create("\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./0123456789:;<=>?@[\\]^_`{|}~\u007f"); That helper is now emitted as: /// <summary>Finds the next index of any character that matches a character in the set [A-Za-z\u212A].</summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static int IndexOfNonAsciiOrAny_0E8811570F3718B3BBF9E5001539AEB605A6A55AB8C5DCF3ECD130597FA78CEC(this ReadOnlySpan<char> span)
{
// Search for the first character that's either ASCII and in the target set or non-ASCII (whether or not it's in the target set.
int i = span.IndexOfAnyExcept(Utilities.s_asciiExceptLetters);
if ((uint)i < (uint)span.Length)
{
// If the character at the found position is ASCII, it's in the target set.
if (char.IsAscii(span[i]))
{
return i;
}
// Search for the first character that's in the target set.
int j = span.Slice(i).IndexOfAny(Utilities.s_nonAscii_326E1FD0AD567A84CAD13F2BE521A57789829F59D59ABE37F9E111D0182B6601);
if (j >= 0)
{
return i + j;
}
}
// No match found.
return -1;
}
/// <summary>Supports searching for characters in or not in "\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./0123456789:;<=>?@[\\]^_`{|}~\u007f".</summary>
internal static readonly SearchValues<char> s_asciiExceptLetters = SearchValues.Create("\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\n\v\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./0123456789:;<=>?@[\\]^_`{|}~\u007f");
/// <summary>Supports searching for characters in or not in "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK".</summary>
internal static readonly SearchValues<char> s_nonAscii_326E1FD0AD567A84CAD13F2BE521A57789829F59D59ABE37F9E111D0182B6601 = SearchValues.Create("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzK");
|
When we encounter a set for which we can't use one of the IndexOfXx variants, we use IndexOfAny with SearchValues. If the whole set is ASCII, this is easy. But if there are any non-ASCII characters in the set, we currently emit a helper method that first does a search for all of the ASCII values or anything that's non-ASCII, and then if something non-ASCII is found, it proceeds to do a scalar scan matching every character. Now that SearchValues has a vectorized implementation of a probabilistic map, we can instead fall back to that when the set isn't too large and thus can be fully enumerated into a SearchValues instance. We still do this as a two-step process, as searching for the ASCII subset is measurably faster than the probabilistic map search. However, if we see that there's no ASCII at all in the target set, we can then skip the helper entirely and just do the IndexOfAny with the SearchValues for the non-ASCII targets.
98063f5
to
bc363cd
Compare
When we encounter a set for which we can't use one of the IndexOfXx variants, we use IndexOfAny with SearchValues. If the whole set is ASCII, this is easy. But if there are any non-ASCII characters in the set, we currently emit a helper method that first does a search for all of the ASCII values or anything that's non-ASCII, and then if something non-ASCII is found, it proceeds to do a scalar scan matching every character.
Now that SearchValues has a vectorized implementation of a probabilistic map, we can instead fall back to that when the set isn't too large and thus can be fully enumerated into a SearchValues instance. We still do this as a two-step process, as searching for the ASCII subset is measurably faster than the probabilistic map search. However, if we see that there's no ASCII at all in the target set, we can then skip the helper entirely and just do the IndexOfAny with the SearchValues for the non-ASCII targets.
For example, previously:
would result in emitting this helper (since it's not only inclusive of A-Z and a-z, but also the Kelvin sign)
That helper is now emitted as: