Overhaul Regex's handling of RegexOptions.IgnoreCase #61048

stephentoub · 2021-10-31T19:40:16Z

This is related to and impacts / is impacted by #59492, but they are distinct.

Today, Regex implements RegexOptions.IgnoreCase by lowercasing everything in the pattern according to the culture at the time of construction, and then at the time of match it lowercases everything in the input lazily as part of evaluating whether something matches. There are multiple issues with this, including the fact that if the cultures differ between construction and match time, this might lead to incorrect results due to differences in how things case. This leads to problems like #60753 and #36147, and it indirectly leads to complications like #36149. Those should all be addressed by properly fixing #59492. This also runs counter to the general argument that if you need to lower or upper case to determine equality, you should upper case.

Beyond the functional issues, however, it also leads to non-trivial performance problems. First, it means we need to call ToLower{Invariant} for every character of the input, and potentially multiple times if there's any backtracking. But worse, any time we come across a case-insensitive character, it knocks us off our fast paths, for a variety of things. For example, we can no longer use vectorized IndexOf{Any} to search for a character, and instead need to walk character-by-character calling ToLower on each to then compare against the lowercase value. If we could instead just have a set of all of the characters that should be treated as equivalent from an ignore-casing perspective, then the rest of our optimizations based on sets should kick in. Let's say the pattern begins with 'a' and is IgnoreCase. Today we end up walking each character, calling ToLower on each, and comparing it to 'a'. If we fix this, we could instead use IndexOfAny('A', 'a'). Or even in situations where we're forced to compare character by character, we could still avoid having to call ToLower, as the set will already contain all valid, which also means we can avoid having to query CultureInfo.CurrentCulture.

We should consider overhauling the scheme employed:

The only culture that matters is the one present when the Regex is constructed (or InvariantCulture if RegexOptions.CultureInvariant is specified). All (rather than just some) casing decisions are made then. This would also effectively answer the primary question in [API Proposal] Add cultureName constructors to GeneratedRegex #59492 (we'd still need to make a decision on what to do for the source generator).
When the regex is constructed and the pattern analyzed, rather than creating sets that contain the original character and its lowercased version, we create sets that contain that character and any others that should be considered equal according to casing rules.
We stop using ToLower everywhere else in the codebase.

Note that the new NonBacktracking engine does something similar, which a) means we have duplicated logic with distinct approaches, and b) we have differences in behavior between the engines, in particular around handling of Turkish I.

This does mean we'll need some mechanism to determine, for any given culture, what characters should be considered equivalent under IgnoreCase. In the extreme, you could imagine that for every character we upper case it, and everything that uppercased to the same value is considered to be part of the same IgnoreCase equivalence class... but that's something we'd ideally not do at run time, given the obvious overheads. We also need to consider how to efficiently handle ranges. Again, the NonBacktracking implementation has some prior art here, but it also has some things we need to address, e.g. #60753 (comment).

cc: @tarekgh, @GrabYourPitchforks, @veanes, @olsaarik, @danmoseley

ghost · 2021-10-31T19:40:20Z

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This is related to and impacts / is impacted by #59492, but they are distinct.

Today, Regex implements RegexOptions.IgnoreCase by lowercasing everything in the pattern according to the culture at the time of construction, and then at the time of match it lowercases everything in the input lazily as part of evaluating whether something matches. There are multiple issues with this, including the fact that if the cultures differ between construction and match time, this might lead to incorrect results due to differences in how things case. This leads to problems like #60753 and #36147, and it indirectly leads to complications like #36149. Those should all be addressed by properly fixing #59492. This also runs counter to the general argument that if you need to lower or upper case to determine equality, you should upper case.

Beyond the functional issues, however, it also leads to non-trivial performance problems. First, it means we need to call ToLower{Invariant} for every character of the input, and potentially multiple times if there's any backtracking. But worse, any time we come across a case-insensitive character, it knocks us off our fast paths, for a variety of things. For example, we can no longer use vectorized IndexOf{Any} to search for a character, and instead need to walk character-by-character calling ToLower on each to then compare against the lowercase value. If we could instead just have a set of all of the characters that should be treated as equivalent from an ignore-casing perspective, then the rest of our optimizations based on sets should kick in. Let's say the pattern begins with 'a' and is IgnoreCase. Today we end up walking each character, calling ToLower on each, and comparing it to 'a'. If we fix this, we could instead use IndexOfAny('A', 'a'). Or even in situations where we're forced to compare character by character, we could still avoid having to call ToLower, as the set will already contain all valid.

We should consider overhauling the scheme employed:

The only culture that matters is the one present when the Regex is constructed. All (rather than just some) casing decisions are made then. This would also effectively answer the primary question in [API Proposal] Add cultureName constructors to GeneratedRegex #59492 (we'd still need to make a decision on what to do for the source generator).
When the pattern is constructed, rather than creating sets that contain the original character and its lowercased version, we create sets that contain that character and any others that should be considered equal according to casing rules.
We stop using ToLower everywhere else in the codebase.

Note that the new NonBacktracking engine does something similar, which a) means we have duplicated logic with distinct approaches, and b) we have differences in behavior between the engines, in particular around handling of Turkish I.

This does mean we'll need some mechanism to determine, for any given culture, what characters should be considered equivalent under IgnoreCase. In the extreme, you could imagine that for every character we upper case it, and everything that uppercased to the same value is considered to be part of the same IgnoreCase equivalence class... but that's something we'd ideally not do at run time, given the obvious overheads. We also need to consider how to efficiently handle ranges. Again, the NonBacktracking implementation has some prior art here, but it also has some things we need to address, e.g. #60753 (comment).

cc: @tarekgh, @GrabYourPitchforks, @veanes, @olsaarik, @danmoseley

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`
Milestone:	7.0.0

stephentoub · 2021-11-09T05:09:10Z

@tarekgh, @GrabYourPitchforks, @joperezr, and I spent some time talking about this and #59492. General consensus was:

We should proceed to switch Regex over to a plan where the only culture that's relevant is the one at construction time, not match time. While technically that's a breaking change, in practice the split was full of bugs and inconsistent behavior that would make it difficult to have relied on meaningfully.
We should special-case tr/az languages, in particular around 'i', but otherwise we should treat every other culture the same, using the equivalent of CultureInfo.InvariantCulture data. It's already the case that all cultures (other than invariant and the tr/az cultures) have the same casing data, and invariant only differs in a few very specific and corner-case places.
We should ship as part of System.Text.RegularExpressions.dll a case folding table, based on CultureInfo.InvariantCulture, that will enable us to quickly make casing decisions at construction time. If desired, that case folding table can also be used at match time, though ideally that would be avoided as well, in particular for the source generator, and if necessary char.ToInvariantLower/Upper may be used.
We should consider using ToInvariantUpper as that basis for that case folding, rather than ToInvariantLower. While Unicode case folding data is closer in nature to lowercasing, using uppercasing would yield greater consistency with the rest of .NET. However, there may be other advantages to using ToInvariantLower, e.g. fewer non-ASCII characters lowercase to ASCII than do non-ASCII characters uppercase to ASCII.
We should put tests in place to validate that our casing tables are up-to-date with the Unicode tables used by that release of .NET.
For the NonBacktracking engine, we should still aim to be as lazy as possible in building up the necessary state, in order to minimize size impact.
For generating the casing tables, @GrabYourPitchforks and @tarekgh have a tool that's used for the tables built into corelib, and that can be updated to accomodate this need as well.

stephentoub · 2021-11-21T01:52:52Z

One wrinkle we'll need to sort out here are backreferences. I believe it's the one case where we need to compare input text not against the pattern but against other input text. You can have a pattern like:

(\w)\1

which means "match any word character and then match that same character again". And if this were with IgnoreCase, then "that same character again" can be any of the cased variants of that original character, e.g. "Aa" would match successfully. I believe this just means we'll need to consult our case folding table at match time as well, just in the case of IgnoreCase backreferences, which also means for the source generator we'll need to emit the casing table if the regex contains an IgnoreCase backreference. We might be able to be a bit smarter about it, e.g. looking at what the referenced capture contains and seeing whether it could match anything that might possibly vary in case (if the only things the capture could match are categories that participate in case conversion).

joperezr · 2021-11-21T02:50:14Z

Great catch, yeah we’ll make sure we cover that case

stephentoub added the area-System.Text.RegularExpressions label Oct 31, 2021

stephentoub added this to the 7.0.0 milestone Oct 31, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Oct 31, 2021

stephentoub removed the untriaged New issue has not been triaged by the area owner label Oct 31, 2021

stephentoub mentioned this issue Nov 18, 2021

[Perf] Regression in Perf_Regex.Uri_IsNotMatch(IgnoreCase, Compiled) #61786

Closed

ghost added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Nov 21, 2021

stephentoub assigned joperezr Nov 23, 2021

stephentoub mentioned this issue Nov 23, 2021

Regressions in System.Text.RegularExpressions.Tests.Perf_Regex_Common #61972

Closed

This was referenced Dec 6, 2021

Extend Regex 32-bit/64-bit read optimization to IgnoreCase as well #62459

Closed

Improve single-range IgnoreCase codegen for RegexCompiler / source generator #62647

Closed

joperezr mentioned this issue Dec 14, 2021

System.Text.RegularExpressions work planned for .NET 7 #62758

Closed

48 tasks

joperezr removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Dec 15, 2021

danmoseley mentioned this issue Dec 15, 2021

NoWarn property is not set correct when using Turkish OS Language dotnet/msbuild#7154

Closed

stephentoub mentioned this issue Jan 9, 2022

Improve auto-atomicity for conditionals #63299

Merged

This was referenced Mar 8, 2022

Use StartsWith literal optimization for OrdinalIgnoreCase in RegexCompiler / source generator #66324

Closed

Use StartsWith(..., OrdinalIgnoreCase) in RegexCompiler / source generator #66339

Merged

joperezr mentioned this issue Mar 8, 2022

Updating Unicode files to use 14.0.0 #66362

Merged

joperezr mentioned this issue Mar 26, 2022

Changing the logic for how we deal with RegexOptions.IgnoreCase matching. #67184

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 26, 2022

joperezr mentioned this issue Apr 5, 2022

Consider using RegexCasingEquivalence table for case insensitive matches when the pattern contains backreferences #67624

Closed

stephentoub closed this as completed in #67184 Apr 6, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 6, 2022

joperezr mentioned this issue Apr 14, 2022

What's new in .NET 7 Preview 4 [WIP] dotnet/core#7378

Closed

ghost locked as resolved and limited conversation to collaborators May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul Regex's handling of RegexOptions.IgnoreCase #61048

Overhaul Regex's handling of RegexOptions.IgnoreCase #61048

stephentoub commented Oct 31, 2021 •

edited

Loading

ghost commented Oct 31, 2021

stephentoub commented Nov 9, 2021 •

edited

Loading

stephentoub commented Nov 21, 2021 •

edited

Loading

joperezr commented Nov 21, 2021

Overhaul Regex's handling of RegexOptions.IgnoreCase #61048

Overhaul Regex's handling of RegexOptions.IgnoreCase #61048

Comments

stephentoub commented Oct 31, 2021 • edited Loading

ghost commented Oct 31, 2021

stephentoub commented Nov 9, 2021 • edited Loading

stephentoub commented Nov 21, 2021 • edited Loading

joperezr commented Nov 21, 2021

stephentoub commented Oct 31, 2021 •

edited

Loading

stephentoub commented Nov 9, 2021 •

edited

Loading

stephentoub commented Nov 21, 2021 •

edited

Loading