-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case conversion test fails on Alpine Linux #127
Comments
Is this our bug our a bug in Alpine Linux? |
Unclear and I'm not sure how to figure it out, so I figured I'd open an issue here to see if anyone had any ideas. |
What character are they disagreeing on, and what answer is utf8proc giving vs |
The error message is saying
|
I'm confused, it looks like the error message is saying that |
Sure enough: #include <stdio.h>
#include <inttypes.h>
#include <wctype.h>
#include "utf8proc.h"
int main(int argc, char** argv) {
int32_t u = utf8proc_toupper(0x00df);
int32_t w = towupper(0x00df);
printf("utf8proc_toupper: %x\ntowupper: %x\n", u, w);
return 0;
} Output on Ubuntu:
Output on Alpine:
And yet, julia> uppercase(Char(0xdf))
'ß': Unicode U+00df (category Ll: Letter, lowercase) on both systems. |
So I guess it's Alpine's fault for returning a bogus value from |
Since this isn't a utf8proc issue, I'll close this. Thanks for humoring me, Steven. 😉 |
Unless the uppercase mapping of U+00df changed in Unicode 10 (since utf8proc currently uses the Unicode 9 tables)? This U+00df page says that U+1E9E is a "nonstandard uppercase," though. |
cc @richfelker |
Regarding case mappings, this is intentional, not a bug: For wcwidth, I'd have to see what the mismatches are. |
I agree that there is an argument for supporting the nonstandard uppercase form here. |
Whether "ẞ" or "SS" is preferred is subject to cultural considerations, but the C locale system cannot represent the latter mapping. "ß" is obviously not a correct uppercase form for "ß". It's been a while since I delved into the Unicode stability policy, but my understanding is that they can't (by their own policy) add new case mappings for characters that previously lacked them; even if that's wrong, they may want to avoid adding a nominal case mapping to a single character when the mapping to a sequence "SS" may be preferred in some cultures. I don't think these considerations detract from mapping to "ẞ" being the right thing to do in the limitations of the C locale framework. |
Regarding the width mismatches, I wonder if that is due to the treatment of east asian neutral? See #83 |
That thread (#83) is a mess. If there are people who want to solve the problem of whether certain scripts (or some characters from certain scripts) should be treated as wcwidth=2, there needs to be an organized effort, outside of a single software project like this, involving actual users/experts of the affected scripts, not pulling values out of some random font file (unifont), and there should be interest from key implementors in supporting the outcome before the process begins. Until then, musl (and afaik, also glibc) take a simple approach and assign width=1 to everything except to characters that were explicitly wide in legacy CJK charsets. |
You assign width=1 to combining characters? |
Sorry, I was not sufficiently precise. Of course nonspacing combining characters (Mn) and certain other nonspacing (most of Cf) characters are wcwidth=0, and control characters (nonprintable) are wcwidth=-1. |
I assume you mean "East Asian Wide" characters in UAX#11? These aren't just legacy charsets — emoji were changed to wide in Unicode 9 IIRC. |
musl's definitions are derived programmatically from EastAsianWidth.txt from Unicode 10.0, and I don't see any emoji marked as wide in it. Aside from actual ideographic characters, the only characters I'm aware of which are marked full/wide are the ones present in legacy charsets. |
https://www.unicode.org/reports/tr11/tr11-31.html#ED4 says that characters with the property |
Well it says they're classified as such, and in fact the ones with |
Indeed, the width mismatches on Alpine stem from Alpine treating a lot of things as I don't believe it's the case with glibc though; I get no mismatches running |
@ararslan, can you give an example of a character we assign width 0 where musl gives |
The full list of mismatches, all 340,000+ lines, is here: https://gist.github.com/ararslan/c7dfbfb0f9dff42940a394c79be0afe3 Taking the first few entries from there:
It looks like musl assigns width 1 to unassigned code points. |
This is an interesting example where musl gives 0 and we give 2:
|
U+00ad is a soft hyphen, which is an interesting case. In some contexts it is used as a hyphenation hint and is not displayed, and in other cases it is displayed. Many terminal environments do display it, but this is not required. U+0380 and several of the other characters are unassigned code points. It's not at all obvious what width we should use for these. I suppose, from a probabilistic standpoint, an unassigned codepoint is probably more likely to be used for a width-1 character (e.g. a private encoding like Conscript) than a width-0 character. Also, the replacement character u+fffd has width 1. U+0601 certainly doesn't look like zero width, and isn't rendered as zero width by any font that I have (e.g. ab). |
I don't know the situation with U+0601. It's probably a case of needing a special-casing. Unicode class Cf is a mess of inconsistency, and while most of them are nonspacing marks or printable formatting controls, apparently some are spacing characters too. U+00AD is already handled specially here (by historical practice it's spacing in charcell terminals, which is what wcwidth is accounting for) and U+0601 probably should be too. I don't know what evidence there is for treating it as wcwidth=2 though. That brings us back to the need for some organized review effort. |
Note that there already is an "organized review effort, outside of any single software project", sponsored by the GNU project, carried out by both language and typography experts, to identify standards-conformant terminal-compatible font metrics and glyphs for the entirety of Unicode. The result is called GNU Unifont, and it is updated every time Unicode is updated. The problem is not a lack of data or a lack of review, but rather it is the xkcd standards problem of getting everyone, up and down the stack, to agree on which data to use. |
I was not aware of any such aspect to the GNU Unifont project, and wasn't even aware that it's still maintained. If it's really trying to act as a standards process for character-cell metrics, that sounds great, but there seems to be a serious lack of publicity around it. I can't even find anything supporting that claim on their website. Last I looked at it, the glyphs for many scripts were not actually usable for writing using them, and many were double-width just because it turned out to be easier to draw something nice looking in 16x16 than 8x16 for certain characters. |
Their release notes make it pretty clear that Unifont is actively maintained (and is promptly updated every time Unicode is updated); I'm not sure why you would think otherwise. It is targeted especially at low-resolution displays, it's true, but that is precisely the situation most appropriate for terminals (which often use the minimum readable font size). As for trying to act as a "standards process", now you're raising the bar. That requires buy-in from libc maintainers, who currently seem to be rejecting out-of-hand any attempt to go beyond EastAsianWidth.txt for glyph metrics. |
The above comment that started this: was specifically about acting as a "standards process" with buy-in from implementors (wcwidth implementors being libc, also terminals, screen/tmux, etc.). I am not rejecting attempts to define a better wcwidth out-of-hand. I'm rejecting attempts to claim that unilateral decisions by a party with almost no stake and almost no input from users of the affected scripts should be a basis for our decisions. |
I'm not sure this is an accurate description of the Unifont developers. Anyway, though we don't use it as the sole basis of our decisions, it seems like a more reasonable starting point than "all characters have width 1", which employs no input at all from users of the affected scripts. Reasonable people can disagree about this, of course, but I don't think it's completely crazy for us to incorporate data from Unifont in determining charwidth for cases that are ambiguous in the Unicode standard. |
By the way, regarding emoji being width 2, the lack of recognition of this prior to Unicode 9 led to this amusing issue in Julia that directly led to our attempt to get better charwidth tables: JuliaLang/julia#3721 |
Output from
make check
:There are also an extraordinary number of mismatches with the system
wcwidth
, though the width tests still pass.The text was updated successfully, but these errors were encountered: