-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider using RegexCasingEquivalence table for case insensitive matches when the pattern contains backreferences #67624
Comments
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsWith #67184 we changed the logic of how case-insensitive comparisons are made between the pattern and the input. Before that change, we used to call ToLower() on the pattern and ToLower() on the input and see if they matched, but this design had several problems (see #61048 for more details). After that change, we are now using a Casing equivalence table built into the assembly which tells us given a character That said, we still have one scenario in which we are not yet using the case equivalence table to perform the case insensitive comparisons, which is when the pattern has backreferences. An example of this, would be if you have a pattern like For now, we have opted to make backreference case-insensitive comparisons to fall back to use This issue is in order to track the work of re-evaluating the decision of not using the Regex casing table when matching backreferences, and to start the discussion to see what we should do for .NET 7 RTM. cc: @danmoseley @stephentoub @tarekgh
|
With #67184 we changed the logic of how case-insensitive comparisons are made between the pattern and the input. Before that change, we used to call ToLower() on the pattern and ToLower() on the input and see if they matched, but this design had several problems (see #61048 for more details). After that change, we are now using a Casing equivalence table built into the assembly which tells us given a character
c
which characters should be considered equivalent toc
. Then, using that table we can transform patterns that look likeABC
into[Aa][Bb][Cc]
.That said, we still have one scenario in which we are not yet using the case equivalence table to perform the case insensitive comparisons, which is when the pattern has backreferences. An example of this, would be if you have a pattern like
(A)\1
, we would now be transforming it to([Aa])\1
. If you use that pattern to try to match something likeaA
, then you need to be able to perform the case insensitive comparison between the next character in the input, and what previously matched at position 1. This is not using the case equivalence table today, mainly because doing so would mean that we would need to either expose it publicly so that the source generator engine could consume it (given source generator engine consumes it from an external assembly) or to not support IgnoreCase Backreference patterns all together with the source generator engine.For now, we have opted to make backreference case-insensitive comparisons to fall back to use
TextInfo(a).ToLower() == TextInfo(b).ToLower()
which should have the same behavior than our Regex case equivalence table in the vast majority of cases. However, we did find some cases where there could be inconsistencies with our Regex case equivalence table, for example, if the application is running on Windows 7 or Windows 8.1, or if the application is using NLS Globalization. The thing that all of those three cases have in common, is that they all use NLS for the case conversions, and we have found that in the InvariantCulture there are around 11 cases where the case conversions don't match the ones that would be created if running with ICU Globalization (and hence, they don't match the Regex casing table).This issue is in order to track the work of re-evaluating the decision of not using the Regex casing table when matching backreferences, and to start the discussion to see what we should do for .NET 7 RTM.
cc: @danmoseley @stephentoub @tarekgh
The text was updated successfully, but these errors were encountered: