Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using RegexCasingEquivalence table for case insensitive matches when the pattern contains backreferences #67624

Closed
joperezr opened this issue Apr 5, 2022 · 1 comment · Fixed by #67977

Comments

@joperezr
Copy link
Member

joperezr commented Apr 5, 2022

With #67184 we changed the logic of how case-insensitive comparisons are made between the pattern and the input. Before that change, we used to call ToLower() on the pattern and ToLower() on the input and see if they matched, but this design had several problems (see #61048 for more details). After that change, we are now using a Casing equivalence table built into the assembly which tells us given a character c which characters should be considered equivalent to c. Then, using that table we can transform patterns that look like ABC into [Aa][Bb][Cc].

That said, we still have one scenario in which we are not yet using the case equivalence table to perform the case insensitive comparisons, which is when the pattern has backreferences. An example of this, would be if you have a pattern like (A)\1, we would now be transforming it to ([Aa])\1. If you use that pattern to try to match something like aA, then you need to be able to perform the case insensitive comparison between the next character in the input, and what previously matched at position 1. This is not using the case equivalence table today, mainly because doing so would mean that we would need to either expose it publicly so that the source generator engine could consume it (given source generator engine consumes it from an external assembly) or to not support IgnoreCase Backreference patterns all together with the source generator engine.

For now, we have opted to make backreference case-insensitive comparisons to fall back to use TextInfo(a).ToLower() == TextInfo(b).ToLower() which should have the same behavior than our Regex case equivalence table in the vast majority of cases. However, we did find some cases where there could be inconsistencies with our Regex case equivalence table, for example, if the application is running on Windows 7 or Windows 8.1, or if the application is using NLS Globalization. The thing that all of those three cases have in common, is that they all use NLS for the case conversions, and we have found that in the InvariantCulture there are around 11 cases where the case conversions don't match the ones that would be created if running with ICU Globalization (and hence, they don't match the Regex casing table).

This issue is in order to track the work of re-evaluating the decision of not using the Regex casing table when matching backreferences, and to start the discussion to see what we should do for .NET 7 RTM.

cc: @danmoseley @stephentoub @tarekgh

@joperezr joperezr added this to the 7.0.0 milestone Apr 5, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 5, 2022
@ghost
Copy link

ghost commented Apr 5, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

With #67184 we changed the logic of how case-insensitive comparisons are made between the pattern and the input. Before that change, we used to call ToLower() on the pattern and ToLower() on the input and see if they matched, but this design had several problems (see #61048 for more details). After that change, we are now using a Casing equivalence table built into the assembly which tells us given a character c which characters should be considered equivalent to c. Then, using that table we can transform patterns that look like ABC into [Aa][Bb][Cc].

That said, we still have one scenario in which we are not yet using the case equivalence table to perform the case insensitive comparisons, which is when the pattern has backreferences. An example of this, would be if you have a pattern like (A)\1, we would now be transforming it to ([Aa])\1. If you use that pattern to try to match something like aA, then you need to be able to perform the case insensitive comparison between the next character in the input, and what previously matched at position 1. This is not using the case equivalence table today, mainly because doing so would mean that we would need to either expose it publicly so that the source generator engine could consume it (given source generator engine consumes it from an external assembly) or to not support IgnoreCase Backreference patterns all together with the source generator engine.

For now, we have opted to make backreference case-insensitive comparisons to fall back to use TextInfo(a).ToLower() == TextInfo(b).ToLower() which should have the same behavior than our Regex case equivalence table in the vast majority of cases. However, we did find some cases where there could be inconsistencies with our Regex case equivalence table, for example, if the application is running on Windows 7 or Windows 8.1, or if the application is using NLS Globalization. The thing that all of those three cases have in common, is that they all use NLS for the case conversions, and we have found that in the InvariantCulture there are around 11 cases where the case conversions don't match the ones that would be created if running with ICU Globalization (and hence, they don't match the Regex casing table).

This issue is in order to track the work of re-evaluating the decision of not using the Regex casing table when matching backreferences, and to start the discussion to see what we should do for .NET 7 RTM.

cc: @danmoseley @stephentoub @tarekgh

Author: joperezr
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: 7.0.0

@joperezr joperezr self-assigned this Apr 5, 2022
@joperezr joperezr removed the untriaged New issue has not been triaged by the area owner label Apr 5, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 13, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 20, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant