Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul Regex's handling of RegexOptions.IgnoreCase #61048

Closed
stephentoub opened this issue Oct 31, 2021 · 4 comments · Fixed by #67184
Closed

Overhaul Regex's handling of RegexOptions.IgnoreCase #61048

stephentoub opened this issue Oct 31, 2021 · 4 comments · Fixed by #67184

Comments

@stephentoub
Copy link
Member

stephentoub commented Oct 31, 2021

This is related to and impacts / is impacted by #59492, but they are distinct.

Today, Regex implements RegexOptions.IgnoreCase by lowercasing everything in the pattern according to the culture at the time of construction, and then at the time of match it lowercases everything in the input lazily as part of evaluating whether something matches. There are multiple issues with this, including the fact that if the cultures differ between construction and match time, this might lead to incorrect results due to differences in how things case. This leads to problems like #60753 and #36147, and it indirectly leads to complications like #36149. Those should all be addressed by properly fixing #59492. This also runs counter to the general argument that if you need to lower or upper case to determine equality, you should upper case.

Beyond the functional issues, however, it also leads to non-trivial performance problems. First, it means we need to call ToLower{Invariant} for every character of the input, and potentially multiple times if there's any backtracking. But worse, any time we come across a case-insensitive character, it knocks us off our fast paths, for a variety of things. For example, we can no longer use vectorized IndexOf{Any} to search for a character, and instead need to walk character-by-character calling ToLower on each to then compare against the lowercase value. If we could instead just have a set of all of the characters that should be treated as equivalent from an ignore-casing perspective, then the rest of our optimizations based on sets should kick in. Let's say the pattern begins with 'a' and is IgnoreCase. Today we end up walking each character, calling ToLower on each, and comparing it to 'a'. If we fix this, we could instead use IndexOfAny('A', 'a'). Or even in situations where we're forced to compare character by character, we could still avoid having to call ToLower, as the set will already contain all valid, which also means we can avoid having to query CultureInfo.CurrentCulture.

We should consider overhauling the scheme employed:

  1. The only culture that matters is the one present when the Regex is constructed (or InvariantCulture if RegexOptions.CultureInvariant is specified). All (rather than just some) casing decisions are made then. This would also effectively answer the primary question in [API Proposal] Add cultureName constructors to GeneratedRegex #59492 (we'd still need to make a decision on what to do for the source generator).
  2. When the regex is constructed and the pattern analyzed, rather than creating sets that contain the original character and its lowercased version, we create sets that contain that character and any others that should be considered equal according to casing rules.
  3. We stop using ToLower everywhere else in the codebase.

Note that the new NonBacktracking engine does something similar, which a) means we have duplicated logic with distinct approaches, and b) we have differences in behavior between the engines, in particular around handling of Turkish I.

This does mean we'll need some mechanism to determine, for any given culture, what characters should be considered equivalent under IgnoreCase. In the extreme, you could imagine that for every character we upper case it, and everything that uppercased to the same value is considered to be part of the same IgnoreCase equivalence class... but that's something we'd ideally not do at run time, given the obvious overheads. We also need to consider how to efficiently handle ranges. Again, the NonBacktracking implementation has some prior art here, but it also has some things we need to address, e.g. #60753 (comment).

cc: @tarekgh, @GrabYourPitchforks, @veanes, @olsaarik, @danmoseley

@ghost
Copy link

ghost commented Oct 31, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

This is related to and impacts / is impacted by #59492, but they are distinct.

Today, Regex implements RegexOptions.IgnoreCase by lowercasing everything in the pattern according to the culture at the time of construction, and then at the time of match it lowercases everything in the input lazily as part of evaluating whether something matches. There are multiple issues with this, including the fact that if the cultures differ between construction and match time, this might lead to incorrect results due to differences in how things case. This leads to problems like #60753 and #36147, and it indirectly leads to complications like #36149. Those should all be addressed by properly fixing #59492. This also runs counter to the general argument that if you need to lower or upper case to determine equality, you should upper case.

Beyond the functional issues, however, it also leads to non-trivial performance problems. First, it means we need to call ToLower{Invariant} for every character of the input, and potentially multiple times if there's any backtracking. But worse, any time we come across a case-insensitive character, it knocks us off our fast paths, for a variety of things. For example, we can no longer use vectorized IndexOf{Any} to search for a character, and instead need to walk character-by-character calling ToLower on each to then compare against the lowercase value. If we could instead just have a set of all of the characters that should be treated as equivalent from an ignore-casing perspective, then the rest of our optimizations based on sets should kick in. Let's say the pattern begins with 'a' and is IgnoreCase. Today we end up walking each character, calling ToLower on each, and comparing it to 'a'. If we fix this, we could instead use IndexOfAny('A', 'a'). Or even in situations where we're forced to compare character by character, we could still avoid having to call ToLower, as the set will already contain all valid.

We should consider overhauling the scheme employed:

  1. The only culture that matters is the one present when the Regex is constructed. All (rather than just some) casing decisions are made then. This would also effectively answer the primary question in [API Proposal] Add cultureName constructors to GeneratedRegex #59492 (we'd still need to make a decision on what to do for the source generator).
  2. When the pattern is constructed, rather than creating sets that contain the original character and its lowercased version, we create sets that contain that character and any others that should be considered equal according to casing rules.
  3. We stop using ToLower everywhere else in the codebase.

Note that the new NonBacktracking engine does something similar, which a) means we have duplicated logic with distinct approaches, and b) we have differences in behavior between the engines, in particular around handling of Turkish I.

This does mean we'll need some mechanism to determine, for any given culture, what characters should be considered equivalent under IgnoreCase. In the extreme, you could imagine that for every character we upper case it, and everything that uppercased to the same value is considered to be part of the same IgnoreCase equivalence class... but that's something we'd ideally not do at run time, given the obvious overheads. We also need to consider how to efficiently handle ranges. Again, the NonBacktracking implementation has some prior art here, but it also has some things we need to address, e.g. #60753 (comment).

cc: @tarekgh, @GrabYourPitchforks, @veanes, @olsaarik, @danmoseley

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: 7.0.0

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Oct 31, 2021
@stephentoub stephentoub removed the untriaged New issue has not been triaged by the area owner label Oct 31, 2021
@stephentoub
Copy link
Member Author

stephentoub commented Nov 9, 2021

@tarekgh, @GrabYourPitchforks, @joperezr, and I spent some time talking about this and #59492. General consensus was:

  1. We should proceed to switch Regex over to a plan where the only culture that's relevant is the one at construction time, not match time. While technically that's a breaking change, in practice the split was full of bugs and inconsistent behavior that would make it difficult to have relied on meaningfully.
  2. We should special-case tr/az languages, in particular around 'i', but otherwise we should treat every other culture the same, using the equivalent of CultureInfo.InvariantCulture data. It's already the case that all cultures (other than invariant and the tr/az cultures) have the same casing data, and invariant only differs in a few very specific and corner-case places.
  3. We should ship as part of System.Text.RegularExpressions.dll a case folding table, based on CultureInfo.InvariantCulture, that will enable us to quickly make casing decisions at construction time. If desired, that case folding table can also be used at match time, though ideally that would be avoided as well, in particular for the source generator, and if necessary char.ToInvariantLower/Upper may be used.
  4. We should consider using ToInvariantUpper as that basis for that case folding, rather than ToInvariantLower. While Unicode case folding data is closer in nature to lowercasing, using uppercasing would yield greater consistency with the rest of .NET. However, there may be other advantages to using ToInvariantLower, e.g. fewer non-ASCII characters lowercase to ASCII than do non-ASCII characters uppercase to ASCII.
  5. We should put tests in place to validate that our casing tables are up-to-date with the Unicode tables used by that release of .NET.
  6. For the NonBacktracking engine, we should still aim to be as lazy as possible in building up the necessary state, in order to minimize size impact.
  7. For generating the casing tables, @GrabYourPitchforks and @tarekgh have a tool that's used for the tables built into corelib, and that can be updated to accomodate this need as well.

@stephentoub
Copy link
Member Author

stephentoub commented Nov 21, 2021

One wrinkle we'll need to sort out here are backreferences. I believe it's the one case where we need to compare input text not against the pattern but against other input text. You can have a pattern like:

(\w)\1

which means "match any word character and then match that same character again". And if this were with IgnoreCase, then "that same character again" can be any of the cased variants of that original character, e.g. "Aa" would match successfully. I believe this just means we'll need to consult our case folding table at match time as well, just in the case of IgnoreCase backreferences, which also means for the source generator we'll need to emit the casing table if the regex contains an IgnoreCase backreference. We might be able to be a bit smarter about it, e.g. looking at what the referenced capture contains and seeing whether it could match anything that might possibly vary in case (if the only things the capture could match are categories that participate in case conversion).

@ghost ghost added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Nov 21, 2021
@joperezr
Copy link
Member

Great catch, yeah we’ll make sure we cover that case

@joperezr joperezr removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Dec 15, 2021
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 26, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 6, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants