Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use RegexCaseEquivalence table for case-insensitive backreferences #67977

Merged
merged 3 commits into from
Apr 20, 2022

Conversation

joperezr
Copy link
Member

Fixes #67624

This change will now make it so we no longer have a potential inconsistency between how we evaluate case comparisons with regular characters vs how we do it for backreferences by making sure the latter also use the same RegexCaseEquivalence table. This change will also make it so that the SourceGenerator won't be able to create a custom regex-derived type if a pattern is case-insensitive and contains backreferences, given the Regex Case equivalence table is internal.

@ghost
Copy link

ghost commented Apr 13, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #67624

This change will now make it so we no longer have a potential inconsistency between how we evaluate case comparisons with regular characters vs how we do it for backreferences by making sure the latter also use the same RegexCaseEquivalence table. This change will also make it so that the SourceGenerator won't be able to create a custom regex-derived type if a pattern is case-insensitive and contains backreferences, given the Regex Case equivalence table is internal.

Author: joperezr
Assignees: joperezr
Labels:

area-System.Text.RegularExpressions

Milestone: -

@danmoseley
Copy link
Member

This change will also make it so that the SourceGenerator won't be able to create a custom regex-derived type if a pattern is case-insensitive and contains backreferences, given the Regex Case equivalence table is internal.

How many patterns in the corpus would that affect? Presumably one option is for the SourceGenerator to paste in the table if it's forced to.

@joperezr
Copy link
Member Author

@stephentoub has the actual numbers but when we talked about this offline the number was very low. IIRC only less than 5% used backreferences, and out of those, an even smaller percentage was also using IgnoreCase. I may be misremembering, but I think that was making it so that less than .05% of the patterns in our corpus would be affected by this.

@joperezr
Copy link
Member Author

Emiting the table with the source generator is one option but of course it is very expensive, another option that was considered was that instead of disabling the source generator for all case-insensitive backreferences, we only did it for cases where the capture group for the backreference is actually involved in case conversion (so for example, patterns like (?i)(\s)\1 would still be supported), but given that we think the number of cases for that will be very low, I'm not doing that here yet.

@stephentoub
Copy link
Member

How many patterns in the corpus would that affect?

Out of 18,964 patterns in the corpus, a grand total of 78 have an IgnoreCase Backreference. I don't believe it's worth emitting a 50K case-folding table into the user's assembly for such cases. And the actual number that need the table is much less; I can't quantify it without actually doing the work, but the majority of such backreferences from manual inspection are referring to symbols, e.g. matching quotes, so if we're concerned with the 0.41%, we can lower that further as Jose says by following the backreference to the capture and examining that part of the tree to determine whether casing could even apply.

@stephentoub
Copy link
Member

stephentoub commented Apr 13, 2022

I can't quantify it without actually doing the work

That didn't sit right with me, so I did the work. If we followed the backreference and removed the IgnoreCase for situations where it's not relevant anyway, the 78 drops to 29, so 0.15%.

@@ -9,14 +9,19 @@ internal sealed class CompiledRegexRunner : RegexRunner
{
private readonly ScanDelegate _scanMethod;
/// <summary>This field will only be set if the pattern contains backreferences and has RegexOptions.IgnoreCase</summary>
private readonly TextInfo? _textInfo;
private readonly CultureInfo? _culture;
#pragma warning disable CA1823 // Avoid unused private fields. Justification: Used via reflection to cache the Case behavior if needed.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate the way this looks, but I'm not sure what else could be done instead. I tried using suppression attributes (which look cleaner) and it works for the CodeAnalysis warning but not for the CS warning around unused field. The other option would be to add NoWarn to the project, but I don't like global suppressions which is why I opted for the local suppression instead, but I'm open for suggestions here.

@joperezr joperezr force-pushed the IgnoreCaseBackreferences branch from dda34b3 to de0fafc Compare April 15, 2022 22:11
@joperezr joperezr force-pushed the IgnoreCaseBackreferences branch from a73e5cb to 04271f7 Compare April 18, 2022 21:42
@joperezr joperezr requested a review from stephentoub April 20, 2022 16:28
Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@joperezr joperezr merged commit d5ce000 into dotnet:main Apr 20, 2022
@joperezr joperezr deleted the IgnoreCaseBackreferences branch April 20, 2022 21:39
@ghost ghost locked as resolved and limited conversation to collaborators May 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants