-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect Regex matching in Turkish culture when ignoring case #58958
Comments
Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions Issue DetailsDescriptionThe combination of ignoring case and using intervals that involve Configuration.NET 6.0 preview Regression?Seems so. At least the below code works correctly in .NET 5.0. Other informationExpected behavior is that the following code prints
|
Appears to be a legit regression. Per https://docs.microsoft.com/dotnet/standard/base-types/regular-expression-options#default-options, regex matching uses the current culture by default unless If this change was intentional, we should add an entry to the breaking changes doc. |
There's a simpler repro that doesn't involve changing cultures: using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
var r = new Regex("[\u0120-\u0130]", RegexOptions.IgnoreCase);
bool result = r.IsMatch("\u0130");
Console.WriteLine(result);
}
} That prints true on .NET Framework 4.8 and .NET 5.0, but false on .NET 6.0. This is due to #42282. With that change, the character class being created for This is complicated code. #42282 was fixing #36149, which has been around forever. I suggest we revert #42282 for .NET 6, as that PR was just trading off one set of bugs for another, and instead stick with the bugs we've had there since the dawn of this codebase. We can revisit for .NET 7. cc: @jeffhandley, @pgovind, @danmoseley |
Seems reasonable. |
Thanks for the investigation, @stephentoub! I agree with the recommendation. @pgovind -- Can you take the assignment of reverting #42282? I recommend making 2 PRs: 1 that is a simple revert, and one that includes:
I imagine we'd only port the clean revert PR into 6.0 GA, without porting the extra tests. |
I just looked at this locally too and happily found this comment: #42282 (comment) discussing this exact behavior. Without going too deep, it stems from the way we populate the I'm still ok reverting #42282 if that's what we want to do. Technically this is a regression, yea, but we could reasonably make the case that is an acceptable breaking change and that this will get fixed when #36147 gets fixed. |
Thanks, @pgovind; that's really helpful info. At this stage in 6.0, I'd prefer to revert, get back to the preexisting behavior, and try again in 7.0. |
FWIW, I don't think it's defensible that these don't all produce the same result: using System.Text.RegularExpressions;
Console.WriteLine(new Regex("\u0130", RegexOptions.IgnoreCase).IsMatch("\u0130"));
Console.WriteLine(new Regex("[\u012F-\u0130]", RegexOptions.IgnoreCase).IsMatch("\u0130"));
Console.WriteLine(new Regex("[\u012F\u0130]", RegexOptions.IgnoreCase).IsMatch("\u0130")); |
Description
The combination of ignoring case and using intervals that involve
\u0130
(Turkish I with dot) and\u0131
(Turkish i without dot) gives wrong matching results as the repo shows.Configuration
.NET 6.0 preview
Regression?
Seems so. At least the below code works correctly in .NET 5.0.
Other information
Expected behavior is that the following code prints
True
but it printsFalse
.The pattern below must trivially match the input because all of the letters fall in the given intervals
IgnoreCase
can only add letters (not remove letters) so the match must hold.If the
IgnoreCase
option is omitted the code works correctly.The text was updated successfully, but these errors were encountered: