-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex isn't factoring target culture into lowercasing of ranges #36147
Comments
Tagging subscribers to this area: @eerhardt |
I believe you are right if the calling code is not filtering or special-casing Turkish and similar cultures (e.g. Azeri) |
Thanks. From talking offline with @GrabYourPitchforks, it seems like an easy-ish fix would be to add a similar table for (That said, with #36149, it's possible there's something wrong with the table or the logic around it.) |
Yeah. Steve and I discussed this a bit offline, and I looked through the CLDR charts at https://github.com/unicode-org/cldr/tree/master/common/transforms. Look specifically at files ending in -upper.xml and -lower.xml. tr and az special-case the dotted and dotless 'i', as Tarek mentioned earlier. el and lt have some special-casing regarding punctuation. When converting lowercase Greek characters to uppercase, diacritics are removed. That is, the grapheme which consists of the two scalar values (lower_greek + trailing_diacritic) maps instead to to the grapheme consisting of the single scalar value (upper_greek). But I don't really think we need to worry about this at the Regex level. So if we wanted to keep a "global" table around as an implementation detail of Regex, we could. It'd work for all cultures except for tr and az. For those cultures, we'd special-case the dotted and dotless 'i'. Finally, if we wanted to compare chars or strings for case-insensitive equality, it would be better to use a case folding algorithm rather than toUpper or toLower. That way, 'σ', 'Σ', and 'ς' will all compare as equal; as will 'ß' and 'ẞ'. There's an open issue regarding this at #20674. But until that comes along toLower or ToUpper is the best we've got for now. |
I have validated that #67184 will indeed close this one. |
e.g.
In Turkish,
I
lowercases toı
(\u0131
), so the above repro should print out true. But whereasRegex
is using the target culture when dealing with individual characters in a set:runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Lines 551 to 556 in fd82afe
when it instead has a range with multiple characters, it delegates to this AddLowercaseRange function:
runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Line 569 in fd82afe
which doesn't factor in the target culture into its decision, instead using a precomputed table:
runtime/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Line 301 in fd82afe
@tarekgh, @GrabYourPitchforks, am I correct that such a table couldn't possibly be right, given that different cultures case differently?
Note that if the above repro is instead changed to spell out the whole range of uppercase letters:
it then correctly prints
true
.cc: @eerhardt, @pgovind
The text was updated successfully, but these errors were encountered: