Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing titlecase and/or uppercase mappings? #257

Open
djudd opened this issue Jul 22, 2022 · 1 comment
Open

Missing titlecase and/or uppercase mappings? #257

djudd opened this issue Jul 22, 2022 · 1 comment

Comments

@djudd
Copy link

djudd commented Jul 22, 2022

Describe the bug
I'm not really sure what's going on here, if this is expected for a reason I don't understand or possibly indicative of a larger problem, but:

ß doesn't appear to have titlecase or uppercase mappings, although it should (IIUC) map to Ss and SS respectively. It does casefold to ss as expected.

To Reproduce
Steps to reproduce the behavior:

pry(main)> "ß".upcase
=> "SS"
pry(main)> "ß".localize.upcase.to_s
=> "ß"
pry(main)> "ß".localize.titlecase.to_s
=> "ß"
pry(main)> "ß".localize.casefold.to_s
=> "ss"

(NB: It doesn't appear to matter what locale I provide; I tried localize(:de) and localize(:zh) with identical results. And similar mappings aren't always missing, e.g. "dž" is handled fine.)

Expected behavior
I expected "ß".localize.upcase.to_s to produce SS (as Ruby's built-in upcase method does) and "ß".localize.titlecase.to_s to produce Ss.

Screenshots
n/a

Environment

pry(main)> TwitterCldr::VERSION
=> "6.11.3"
pry(main)> TwitterCldr.locale
=> :en
pry(main)> RUBY_VERSION
=> "2.7.4"

Additional context
none

@camertron
Copy link
Collaborator

Hey @djudd, thanks for bringing this to my attention. I'm also a bit surprised by the current behavior.

After an investigation, I've uncovered several interesting things of note:

  1. There is no simple uppercase or titlecase mapping for these characters in the Unicode Character Database, which is what TwitterCLDR uses to apply case mappings. Apparently the mappings in the UCD are only 1:1 in order to maintain backwards compatibility with the large number of parsers that were written to expect only single-character replacements.
  2. There is another file in the UCD called SpecialCasing.txt that, until now, I did not know existed. This is where the uppercase mapping from "ß" to "SS" and the titlecase mapping from "ß" to "Ss" come from (as well as a number of other mappings that are either locale-specific or that require additional context to function).
  3. It used to be that a capital Eszett didn't exist. German had 30 lowercase letters and 29 uppercase ones. However, German added a capital Eszett (\u1E9E) to their alphabet in 2017 after a century of debate.

So what's the right thing to do here? The latest version of Unicode only provides a mapping from capital Eszett -> lowercase Eszett, and even SpecialCasing.txt only maps to "SS" and "Ss" with no mention of the capital Eszett. The Unicode standard section 3.13 says only that:

Examples of case tailorings which are not covered by data in SpecialCasing.txt include ... Uppercasing of U+00DF “ß” latin small letter sharp s to U+1E9E latin capital letter sharp s.

So... thanks Unicode? I haven't been able to find any other casing data in Unicode or CLDR. It appears the "correct" thing to do is to map to "SS" and "Ss," even though I have to think lower to uppercase Eszett is probably more correct. Maybe TwitterCLDR could do that specifically for German from Germany, since Swiss German and other dialects don't use the Eszett at all.

In any case, I'll implement the rules in SpecialCasing.txt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants