Regex isn't factoring target culture into lowercasing of ranges #36147

stephentoub · 2020-05-08T20:58:03Z

e.g.

using System;
using System.Globalization;
using System.Text.RegularExpressions;
class Program
{
    static void Main()
    {
        CultureInfo.CurrentCulture = new CultureInfo("tr-TR");
        var r = new Regex(@"[A-Z]", RegexOptions.IgnoreCase);
        Console.WriteLine(r.IsMatch("\u0131")); // should print true, but prints false
    }
}

In Turkish, I lowercases to ı (\u0131), so the above repro should print out true. But whereas Regex is using the target culture when dealing with individual characters in a set:

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

Lines 551 to 556 in fd82afe

    
           SingleRange range = rangeList[i]; 
        
           if (range.First == range.Last) 
        
           { 
        
               char lower = culture.TextInfo.ToLower(range.First); 
        
               rangeList[i] = new SingleRange(lower, lower); 
        
           }

when it instead has a range with multiple characters, it delegates to this AddLowercaseRange function:

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

Line 569 in fd82afe

private void AddLowercaseRange(char chMin, char chMax)

which doesn't factor in the target culture into its decision, instead using a precomputed table:

runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs

Line 301 in fd82afe

private static readonly LowerCaseMapping[] s_lcTable = new LowerCaseMapping[]

@tarekgh, @GrabYourPitchforks, am I correct that such a table couldn't possibly be right, given that different cultures case differently?

Note that if the above repro is instead changed to spell out the whole range of uppercase letters:

using System;
using System.Globalization;
using System.Text.RegularExpressions;
class Program
{
    static void Main()
    {
        CultureInfo.CurrentCulture = new CultureInfo("tr-TR");
        var r = new Regex(@"[ABCDEFGHIJKLMNOPQRSTUVWXYZ]", RegexOptions.IgnoreCase);
        Console.WriteLine(r.IsMatch("\u0131")); // prints true
    }
}

it then correctly prints true.

cc: @eerhardt, @pgovind

The text was updated successfully, but these errors were encountered:

ghost · 2020-05-08T20:58:05Z

Tagging subscribers to this area: @eerhardt
Notify danmosemsft if you want to be subscribed.

tarekgh · 2020-05-08T21:16:39Z

am I correct that such a table couldn't possibly be right, given that different cultures case differently?

I believe you are right if the calling code is not filtering or special-casing Turkish and similar cultures (e.g. Azeri)

stephentoub · 2020-05-08T21:42:42Z

Thanks. From talking offline with @GrabYourPitchforks, it seems like an easy-ish fix would be to add a similar table for tr-* and az-*, and special-case those cultures to use their specific table.

(That said, with #36149, it's possible there's something wrong with the table or the logic around it.)

GrabYourPitchforks · 2020-05-08T22:11:22Z

Yeah. Steve and I discussed this a bit offline, and I looked through the CLDR charts at https://github.com/unicode-org/cldr/tree/master/common/transforms. Look specifically at files ending in -upper.xml and -lower.xml.

tr and az special-case the dotted and dotless 'i', as Tarek mentioned earlier.

el and lt have some special-casing regarding punctuation. When converting lowercase Greek characters to uppercase, diacritics are removed. That is, the grapheme which consists of the two scalar values (lower_greek + trailing_diacritic) maps instead to to the grapheme consisting of the single scalar value (upper_greek). But I don't really think we need to worry about this at the Regex level.

https://github.com/unicode-org/cldr/blob/2dd06669d833823e26872f249aa304bc9d9d2a90/common/transforms/el-Upper.xml#L16-L19

So if we wanted to keep a "global" table around as an implementation detail of Regex, we could. It'd work for all cultures except for tr and az. For those cultures, we'd special-case the dotted and dotless 'i'.

Finally, if we wanted to compare chars or strings for case-insensitive equality, it would be better to use a case folding algorithm rather than toUpper or toLower. That way, 'σ', 'Σ', and 'ς' will all compare as equal; as will 'ß' and 'ẞ'. There's an open issue regarding this at #20674. But until that comes along toLower or ToUpper is the best we've got for now.

joperezr · 2022-04-05T23:07:00Z

I have validated that #67184 will indeed close this one.

stephentoub added the area-System.Text.RegularExpressions label May 8, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label May 8, 2020

stephentoub removed the untriaged New issue has not been triaged by the area owner label Jun 28, 2020

stephentoub added this to the Future milestone Jun 28, 2020

pgovind mentioned this issue Sep 21, 2020

Fix incorrect handling of character range and capitalization in regex #42282

Merged

This was referenced Sep 21, 2021

Incorrect Regex matching in Turkish culture when ignoring case #58958

Closed

Inconsistent Regex matching behavior in InvariantCulture #58956

Closed

stephentoub mentioned this issue Sep 22, 2021

[API Proposal] Add cultureName constructors to GeneratedRegex #59492

Closed

stephentoub mentioned this issue Oct 31, 2021

Overhaul Regex's handling of RegexOptions.IgnoreCase #61048

Closed

stephentoub mentioned this issue Apr 4, 2022

Changing the logic for how we deal with RegexOptions.IgnoreCase matching. #67184

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 5, 2022

stephentoub closed this as completed in #67184 Apr 6, 2022

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 6, 2022

ghost locked as resolved and limited conversation to collaborators May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex isn't factoring target culture into lowercasing of ranges #36147

Regex isn't factoring target culture into lowercasing of ranges #36147

stephentoub commented May 8, 2020 •

edited

Loading

ghost commented May 8, 2020

tarekgh commented May 8, 2020

stephentoub commented May 8, 2020

GrabYourPitchforks commented May 8, 2020

joperezr commented Apr 5, 2022

Regex isn't factoring target culture into lowercasing of ranges #36147

Regex isn't factoring target culture into lowercasing of ranges #36147

Comments

stephentoub commented May 8, 2020 • edited Loading

ghost commented May 8, 2020

tarekgh commented May 8, 2020

stephentoub commented May 8, 2020

GrabYourPitchforks commented May 8, 2020

joperezr commented Apr 5, 2022

stephentoub commented May 8, 2020 •

edited

Loading