-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use RegexCaseEquivalence table for case-insensitive backreferences #67977
Conversation
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions Issue DetailsFixes #67624 This change will now make it so we no longer have a potential inconsistency between how we evaluate case comparisons with regular characters vs how we do it for backreferences by making sure the latter also use the same RegexCaseEquivalence table. This change will also make it so that the SourceGenerator won't be able to create a custom regex-derived type if a pattern is case-insensitive and contains backreferences, given the Regex Case equivalence table is internal.
|
How many patterns in the corpus would that affect? Presumably one option is for the SourceGenerator to paste in the table if it's forced to. |
@stephentoub has the actual numbers but when we talked about this offline the number was very low. IIRC only less than 5% used backreferences, and out of those, an even smaller percentage was also using IgnoreCase. I may be misremembering, but I think that was making it so that less than .05% of the patterns in our corpus would be affected by this. |
Emiting the table with the source generator is one option but of course it is very expensive, another option that was considered was that instead of disabling the source generator for all case-insensitive backreferences, we only did it for cases where the capture group for the backreference is actually involved in case conversion (so for example, patterns like |
Out of 18,964 patterns in the corpus, a grand total of 78 have an IgnoreCase Backreference. I don't believe it's worth emitting a 50K case-folding table into the user's assembly for such cases. And the actual number that need the table is much less; I can't quantify it without actually doing the work, but the majority of such backreferences from manual inspection are referring to symbols, e.g. matching quotes, so if we're concerned with the 0.41%, we can lower that further as Jose says by following the backreference to the capture and examining that part of the tree to determine whether casing could even apply. |
That didn't sit right with me, so I did the work. If we followed the backreference and removed the IgnoreCase for situations where it's not relevant anyway, the 78 drops to 29, so 0.15%. |
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs
Show resolved
Hide resolved
...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs
Outdated
Show resolved
Hide resolved
...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
@@ -9,14 +9,19 @@ internal sealed class CompiledRegexRunner : RegexRunner | |||
{ | |||
private readonly ScanDelegate _scanMethod; | |||
/// <summary>This field will only be set if the pattern contains backreferences and has RegexOptions.IgnoreCase</summary> | |||
private readonly TextInfo? _textInfo; | |||
private readonly CultureInfo? _culture; | |||
#pragma warning disable CA1823 // Avoid unused private fields. Justification: Used via reflection to cache the Case behavior if needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hate the way this looks, but I'm not sure what else could be done instead. I tried using suppression attributes (which look cleaner) and it works for the CodeAnalysis warning but not for the CS warning around unused field. The other option would be to add NoWarn to the project, but I don't like global suppressions which is why I opted for the local suppression instead, but I'm open for suggestions here.
dda34b3
to
de0fafc
Compare
src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs
Outdated
Show resolved
Hide resolved
...s/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCaseEquivalences.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Outdated
Show resolved
Hide resolved
a73e5cb
to
04271f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #67624
This change will now make it so we no longer have a potential inconsistency between how we evaluate case comparisons with regular characters vs how we do it for backreferences by making sure the latter also use the same RegexCaseEquivalence table. This change will also make it so that the SourceGenerator won't be able to create a custom regex-derived type if a pattern is case-insensitive and contains backreferences, given the Regex Case equivalence table is internal.