Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex isn't factoring target culture into lowercasing of ranges #36147

Closed
stephentoub opened this issue May 8, 2020 · 5 comments · Fixed by #67184
Closed

Regex isn't factoring target culture into lowercasing of ranges #36147

stephentoub opened this issue May 8, 2020 · 5 comments · Fixed by #67184

Comments

@stephentoub
Copy link
Member

stephentoub commented May 8, 2020

e.g.

using System;
using System.Globalization;
using System.Text.RegularExpressions;
class Program
{
    static void Main()
    {
        CultureInfo.CurrentCulture = new CultureInfo("tr-TR");
        var r = new Regex(@"[A-Z]", RegexOptions.IgnoreCase);
        Console.WriteLine(r.IsMatch("\u0131")); // should print true, but prints false
    }
}

In Turkish, I lowercases to ı (\u0131), so the above repro should print out true. But whereas Regex is using the target culture when dealing with individual characters in a set:

SingleRange range = rangeList[i];
if (range.First == range.Last)
{
char lower = culture.TextInfo.ToLower(range.First);
rangeList[i] = new SingleRange(lower, lower);
}

when it instead has a range with multiple characters, it delegates to this AddLowercaseRange function:

which doesn't factor in the target culture into its decision, instead using a precomputed table:
private static readonly LowerCaseMapping[] s_lcTable = new LowerCaseMapping[]

@tarekgh, @GrabYourPitchforks, am I correct that such a table couldn't possibly be right, given that different cultures case differently?

Note that if the above repro is instead changed to spell out the whole range of uppercase letters:

using System;
using System.Globalization;
using System.Text.RegularExpressions;
class Program
{
    static void Main()
    {
        CultureInfo.CurrentCulture = new CultureInfo("tr-TR");
        var r = new Regex(@"[ABCDEFGHIJKLMNOPQRSTUVWXYZ]", RegexOptions.IgnoreCase);
        Console.WriteLine(r.IsMatch("\u0131")); // prints true
    }
}

it then correctly prints true.

cc: @eerhardt, @pgovind

@ghost
Copy link

ghost commented May 8, 2020

Tagging subscribers to this area: @eerhardt
Notify danmosemsft if you want to be subscribed.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label May 8, 2020
@tarekgh
Copy link
Member

tarekgh commented May 8, 2020

am I correct that such a table couldn't possibly be right, given that different cultures case differently?

I believe you are right if the calling code is not filtering or special-casing Turkish and similar cultures (e.g. Azeri)

@stephentoub
Copy link
Member Author

Thanks. From talking offline with @GrabYourPitchforks, it seems like an easy-ish fix would be to add a similar table for tr-* and az-*, and special-case those cultures to use their specific table.

(That said, with #36149, it's possible there's something wrong with the table or the logic around it.)

@GrabYourPitchforks
Copy link
Member

Yeah. Steve and I discussed this a bit offline, and I looked through the CLDR charts at https://github.com/unicode-org/cldr/tree/master/common/transforms. Look specifically at files ending in -upper.xml and -lower.xml.

tr and az special-case the dotted and dotless 'i', as Tarek mentioned earlier.

el and lt have some special-casing regarding punctuation. When converting lowercase Greek characters to uppercase, diacritics are removed. That is, the grapheme which consists of the two scalar values (lower_greek + trailing_diacritic) maps instead to to the grapheme consisting of the single scalar value (upper_greek). But I don't really think we need to worry about this at the Regex level.

https://github.com/unicode-org/cldr/blob/2dd06669d833823e26872f249aa304bc9d9d2a90/common/transforms/el-Upper.xml#L16-L19

So if we wanted to keep a "global" table around as an implementation detail of Regex, we could. It'd work for all cultures except for tr and az. For those cultures, we'd special-case the dotted and dotless 'i'.

Finally, if we wanted to compare chars or strings for case-insensitive equality, it would be better to use a case folding algorithm rather than toUpper or toLower. That way, 'σ', 'Σ', and 'ς' will all compare as equal; as will 'ß' and 'ẞ'. There's an open issue regarding this at #20674. But until that comes along toLower or ToUpper is the best we've got for now.

@joperezr
Copy link
Member

joperezr commented Apr 5, 2022

I have validated that #67184 will indeed close this one.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 6, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants