Missing titlecase and/or uppercase mappings? #257

djudd · 2022-07-22T11:23:03Z

Describe the bug
I'm not really sure what's going on here, if this is expected for a reason I don't understand or possibly indicative of a larger problem, but:

ß doesn't appear to have titlecase or uppercase mappings, although it should (IIUC) map to Ss and SS respectively. It does casefold to ss as expected.

To Reproduce
Steps to reproduce the behavior:

pry(main)> "ß".upcase
=> "SS"
pry(main)> "ß".localize.upcase.to_s
=> "ß"
pry(main)> "ß".localize.titlecase.to_s
=> "ß"
pry(main)> "ß".localize.casefold.to_s
=> "ss"

(NB: It doesn't appear to matter what locale I provide; I tried localize(:de) and localize(:zh) with identical results. And similar mappings aren't always missing, e.g. "ǆ" is handled fine.)

Expected behavior
I expected "ß".localize.upcase.to_s to produce SS (as Ruby's built-in upcase method does) and "ß".localize.titlecase.to_s to produce Ss.

Screenshots
n/a

Environment

pry(main)> TwitterCldr::VERSION
=> "6.11.3"
pry(main)> TwitterCldr.locale
=> :en
pry(main)> RUBY_VERSION
=> "2.7.4"

Additional context
none

The text was updated successfully, but these errors were encountered:

camertron · 2022-07-24T21:48:33Z

Hey @djudd, thanks for bringing this to my attention. I'm also a bit surprised by the current behavior.

After an investigation, I've uncovered several interesting things of note:

There is no simple uppercase or titlecase mapping for these characters in the Unicode Character Database, which is what TwitterCLDR uses to apply case mappings. Apparently the mappings in the UCD are only 1:1 in order to maintain backwards compatibility with the large number of parsers that were written to expect only single-character replacements.
There is another file in the UCD called SpecialCasing.txt that, until now, I did not know existed. This is where the uppercase mapping from "ß" to "SS" and the titlecase mapping from "ß" to "Ss" come from (as well as a number of other mappings that are either locale-specific or that require additional context to function).
It used to be that a capital Eszett didn't exist. German had 30 lowercase letters and 29 uppercase ones. However, German added a capital Eszett (\u1E9E) to their alphabet in 2017 after a century of debate.

So what's the right thing to do here? The latest version of Unicode only provides a mapping from capital Eszett -> lowercase Eszett, and even SpecialCasing.txt only maps to "SS" and "Ss" with no mention of the capital Eszett. The Unicode standard section 3.13 says only that:

Examples of case tailorings which are not covered by data in SpecialCasing.txt include ... Uppercasing of U+00DF “ß” latin small letter sharp s to U+1E9E latin capital letter sharp s.

So... thanks Unicode? I haven't been able to find any other casing data in Unicode or CLDR. It appears the "correct" thing to do is to map to "SS" and "Ss," even though I have to think lower to uppercase Eszett is probably more correct. Maybe TwitterCLDR could do that specifically for German from Germany, since Swiss German and other dialects don't use the Eszett at all.

In any case, I'll implement the rules in SpecialCasing.txt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing titlecase and/or uppercase mappings? #257

Missing titlecase and/or uppercase mappings? #257

djudd commented Jul 22, 2022 •

edited

Loading

camertron commented Jul 24, 2022

Missing titlecase and/or uppercase mappings? #257

Missing titlecase and/or uppercase mappings? #257

Comments

djudd commented Jul 22, 2022 • edited Loading

camertron commented Jul 24, 2022

djudd commented Jul 22, 2022 •

edited

Loading