Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Regex matching in Turkish culture when ignoring case #58958

Closed
veanes opened this issue Sep 10, 2021 · 8 comments · Fixed by #59425
Closed

Incorrect Regex matching in Turkish culture when ignoring case #58958

veanes opened this issue Sep 10, 2021 · 8 comments · Fixed by #59425

Comments

@veanes
Copy link
Contributor

veanes commented Sep 10, 2021

Description

The combination of ignoring case and using intervals that involve \u0130 (Turkish I with dot) and \u0131 (Turkish i without dot) gives wrong matching results as the repo shows.

Configuration

.NET 6.0 preview

Regression?

Seems so. At least the below code works correctly in .NET 5.0.

Other information

Expected behavior is that the following code prints True but it prints False.
The pattern below must trivially match the input because all of the letters fall in the given intervals
IgnoreCase can only add letters (not remove letters) so the match must hold.
If the IgnoreCase option is omitted the code works correctly.

using System.Text.RegularExpressions;
using System.Globalization;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = "I\u0131\u0130i";
            string pattern = "[H-J][\u0131-\u0140][\u0120-\u0130][h-j]";

            var culture = CultureInfo.CurrentCulture;
            CultureInfo.CurrentCulture = new CultureInfo("tr-TR");
            Regex re = new Regex(pattern, RegexOptions.IgnoreCase);
            CultureInfo.CurrentCulture = culture;
            Console.WriteLine(re.IsMatch(input));
        }
    }
}

@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Text.RegularExpressions untriaged New issue has not been triaged by the area owner labels Sep 10, 2021
@ghost
Copy link

ghost commented Sep 10, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

The combination of ignoring case and using intervals that involve \u0130 (Turkish I with dot) and \u0131 (Turkish i without dot) gives wrong matching results as the repo shows.

Configuration

.NET 6.0 preview

Regression?

Seems so. At least the below code works correctly in .NET 5.0.

Other information

Expected behavior is that the following code prints True but it prints False.
The pattern below must trivially match the input because all of the letters fall in the given intervals
IgnoreCase can only add letters (not remove letters) so the match must hold.
If the IgnoreCase option is omitted the code works correctly.

using System.Text.RegularExpressions;
using System.Globalization;

namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = "I\u0131\u0130i";
            string pattern = "[H-J][\u0131-\u0140][\u0120-\u0130][h-j]";

            var culture = CultureInfo.CurrentCulture;
            CultureInfo.CurrentCulture = new CultureInfo("tr-TR");
            Regex re = new Regex(pattern, RegexOptions.IgnoreCase);
            CultureInfo.CurrentCulture = culture;
            Console.WriteLine(re.IsMatch(input));
        }
    }
}

Author: veanes
Assignees: -
Labels:

area-System.Text.RegularExpressions, untriaged

Milestone: -

@GrabYourPitchforks
Copy link
Member

Appears to be a legit regression. Per https://docs.microsoft.com/dotnet/standard/base-types/regular-expression-options#default-options, regex matching uses the current culture by default unless RegexOptions.CultureInvariant is passed as an argument.

If this change was intentional, we should add an entry to the breaking changes doc.

@stephentoub
Copy link
Member

stephentoub commented Sep 21, 2021

There's a simpler repro that doesn't involve changing cultures:

using System;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        var r = new Regex("[\u0120-\u0130]", RegexOptions.IgnoreCase);
        bool result = r.IsMatch("\u0130");
        Console.WriteLine(result);
    }
}

That prints true on .NET Framework 4.8 and .NET 5.0, but false on .NET 6.0.

This is due to #42282. With that change, the character class being created for [\u0120-\u0130] only includes the one range and doesn't add any lowercase versions of the characters in that range, even though when the matcher then does char.ToLower('\u0130') as part of the match, it produces a new value outside of that range (all cultures with the exception of invariant lowercase \u0130 to \u0069). Before #42282, the set was also augmented to include \u0069.

This is complicated code. #42282 was fixing #36149, which has been around forever. I suggest we revert #42282 for .NET 6, as that PR was just trading off one set of bugs for another, and instead stick with the bugs we've had there since the dawn of this codebase. We can revisit for .NET 7.

cc: @jeffhandley, @pgovind, @danmoseley

@danmoseley
Copy link
Member

Seems reasonable.

@jeffhandley
Copy link
Member

Thanks for the investigation, @stephentoub! I agree with the recommendation.

@pgovind -- Can you take the assignment of reverting #42282? I recommend making 2 PRs: 1 that is a simple revert, and one that includes:

  1. Tests that had been added to assert Regex is incorrectly handling casing of some ranges #36149 was fixed, but with the tests marked as ignored, referencing the issue
  2. Tests that illustrate this issue in the simple repro, asserting that with the revert in place, those tests pass

I imagine we'd only port the clean revert PR into 6.0 GA, without porting the extra tests.

@pgovind
Copy link

pgovind commented Sep 21, 2021

I just looked at this locally too and happily found this comment: #42282 (comment) discussing this exact behavior. Without going too deep, it stems from the way we populate the s_lcTable table. Essentially s_lcTable is pre-populated with the en-US culture, but the regex constructor calls culture.char.ToLower() and AddToLowercase which could use a different culture setting than s_lcTable. We could say this bug is a dupe (at the least, related) of #36147.

I'm still ok reverting #42282 if that's what we want to do. Technically this is a regression, yea, but we could reasonably make the case that is an acceptable breaking change and that this will get fixed when #36147 gets fixed.

@jeffhandley
Copy link
Member

Thanks, @pgovind; that's really helpful info. At this stage in 6.0, I'd prefer to revert, get back to the preexisting behavior, and try again in 7.0.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 21, 2021
@stephentoub
Copy link
Member

stephentoub commented Sep 21, 2021

we could reasonably make the case that is an acceptable breaking change

FWIW, I don't think it's defensible that these don't all produce the same result:

using System.Text.RegularExpressions;

Console.WriteLine(new Regex("\u0130", RegexOptions.IgnoreCase).IsMatch("\u0130"));
Console.WriteLine(new Regex("[\u012F-\u0130]", RegexOptions.IgnoreCase).IsMatch("\u0130"));
Console.WriteLine(new Regex("[\u012F\u0130]", RegexOptions.IgnoreCase).IsMatch("\u0130"));

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 23, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.