-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set east asian neutral width to 1 #83
Conversation
cc @jiahao … I thought we discussed the neutral-width case at some point? |
I made a small change so that it only forces the width to 1 if it already had a width. As a result, all tests pass again, and the characters I had issues with do now return width 1 as expected. |
@joshuarubin, I don't see where UAX#11 says that neutral characters have width 1. (It only says they map to halfwidth for legacy encodings. For rendering, it says "An implementation might therefore elect to treat them as ambiguous even though they are classified as neutral here.") In #27, we elected to use the Unifont width for "neutral" characters, since this seems to be font dependent. Apparently, the characters you mention are width 2 in Unifont? |
Yes, it seems that Unifont treats them as width 2. However, this decision is causing severe rendering problems on my mac where Terminal.app and iTerm2.app display them as width 1. If there was some way for me to detect these characters externally and correct them, that would be sufficient too, but as far as I can tell there is nothing to distinguish them. |
Is Unifont in the clear minority among fonts here? |
See also the discussion in JuliaLang/julia#3721 .... if the font doesn't match what the terminal thinks the charwidth is, you are going to see problems regardless of what width we return. |
elseif width=="Na"|| width=="H" # narrow or half | ||
elseif width=="Na" || width=="H" # narrow or half | ||
CharWidths[c]=1 | ||
elseif width=="N" && haskey(CharWidths, c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe && get(CharWidths, c, 0) > 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had that, but it didn't actually make a difference to the output. I can add it back, if you'd like.
A good question, I don't know how to answer that though... |
As @stevengj writes, Unicode explicitly avoids making recommendations for character widths. See JuliaLang/julia#3721 (comment) for a discussion, specifically the quote from UAX 11:
Most people unfortunately skip over this very important wording in the beginning of UAX 11.
Could you explain why you say they should be width 1? Nevertheless, the reference glyphs are not meant to be authoritative in specifying character widths and individual fonts can do whatever they want. That's the main problem - it's not going to be possible to solve this problem in general for all fonts. |
(@jiahao, "rune" is Go terminology for a codepoint.) |
Frankly, I agree that It also renders symbols (flags, emoji) as single width despite clearly overflowing into the next column. It is easy to override the output of If I had a way to identify the neutral width characters in utf8proc, I could just handle it on my own. |
Anything I can do to help move this closer to a resolution? |
I'm not sure if there is any satisfactory resolution here. No matter what we do, there will always be buggy terminals that rely on out-of-date operating-system |
For example, IIRC the Unicode 9 standard reclassified emoji as fullwidth, but terminals haven't caught up. |
If you want to match the OS's (probably out-of-date)
|
@jiahao, since neither the fonts nor UAX11 are authoritative here, maybe we should err on the side of the "informative" UAX11 suggestions rather than Unifont, on the theory that UAX11 seems more likely to match what terminals do? |
iTerm has unicode standard switching by proprietary escape code now. |
I can't find that documented, do you have a link? |
My concern here is that most terminals simply provide character widths that don't provide for a reasonable display of the characters in question. Even U+0ca6 above has a reference glyph that would suggest to me a fullwidth rather than halfwidth character. Since characters like these don't exist in the legacy East Asian character sets, I would think that the "default to narrow/halfwidth" behavior is a result more of neglect than an actual attempt to display these characters correctly in fixed width. |
Hi, this is still a problem. After considering this further, I think it may be more palatable, rather than merge this change, if there was a new function added that simply returned if a character width is ambiguous. It should return true for characters in the private use category and for the relevant east asian characters, otherwise false. |
@joshuarubin, that seems reasonable. |
well, after taking a look at what this would require, I certainly think it would be useful to have a way to know if characters have ambiguous width. however, that would still not help the situation that this issue describes. I would like to reiterate my support for overriding unifont in the case where the unicode standard says a character has neutral width but unifont renders it larger than 1 column. The primary reason for this is that other systems implementing unicode character widths (e.g. terminals) will adhere to the spec rather than defer to any particular font choice. Supporting documentation: http://unicode.org/reports/tr11/#ED7
http://unicode.org/reports/tr11/#ED5
iTerm2 character width tables: https://github.com/gnachman/iTerm2/blob/master/sources/NSCharacterSet%2BiTerm.m |
The fundamental problem here, is that everyone along the stack needs to agree on the character width tables. The problem is that there is no agreement on what those are, which is why iTerm added escape codes to switch the tables. It would have been nice not to be in this mess in the first place place, but since we are, I feel like iTerm's approach is the sanest choice. Whatever you do, you're bound to break something. |
I was under the impression utf8proc tried to implement unicode 9 only? I'm not asking for support for other versions. |
If your objective is "look good on terminals", I think @Keno's point is it's still going to break because most terminal software doesn't use up-to-date widths, unless you have something like iTerm. |
The objective is to support unicode 9 terminals including iTerm2 and rxvt. As more terminals will support the widths, the software will need to use them as well. There's a chicken and egg problem here. I'm trying to find ways to get support working across the board in projects like terminals, tmux, vim/neovim, etc. I've now gone and created a new project to help this endeavor, wcwidth9. While I really like |
The problem is that though e.g. I do appreciate you working on this though. It's an important and non-trivial problem. Also cc @gnachman who may be interested in this discussion. |
Isn't the operating system's (probably out-of-date) |
Depends on whether the terminal emulator asks the os's (or rather libc's) wcwidth or has its own character tables. |
That's a very interesting idea, but it will also add to the confusion. I think some confusion while everything transitions to unicode9 is unavoidable at this point, but that the spec should be the guideline, not the font. If the terminal thinks it's width 1 and the software thinks it's width 1 but the font renders it as width 2, then that character will overflow into the next column on the display, but will not cause cascading rendering problems.
I agree that is a very useful datapoint, it's just not what I need now when implementing interfaces.
Again, I'll take that over nothing, but fear it might add long term confusion. As it is, I already need to know if a width is ambiguous (according to east asian context). I also need to know if a character is in the private use area which is ambiguous, but should not vary by east asian context. What's one more option...
|
The character width situation is quite a mess, isn't it? Fonts render glyphs at different sizes so forget about trying to lay out based on the visible size of the glyph. The East Asian width is your best bet even though it's imperfect. I'm happy to help improve the situation for iTerm2 users but there's not much left I can do. I could help you determine the availability of the proprietary table switching escape code at runtime. |
Of course |
The codebases I've looked at, including iTerm2, zsh, neovim and tmux have all replaced (either optionally or automatically through build-time platform identification or build-time tests) |
I think the box drawing characters should have width 1 by default, which a lot of CLI programs assume. Many CJK fonts set their widths to 2, but I think this can be resolved by introducing new terminal settings. |
According to http://unicode.org/reports/tr11/#Recommendations, east asian neutral width characters should always map to either halfwidth or regular (narrow) characters.
Runes like
U+0CA8
(ನ),U+0CB5
(ವ),U+0CB9
(ಹ),U+0C97
(ಗ),U+0CA6
(ದ),U+0CB0
(ರ) are being reported as width 2, when they should be width 1.This patch attempts to fix the issue, but many tests fail as a result (e.g.
non-printing dfff had width 1
).I am happy to help resolve the broken tests, but am not sure of the best approach. Please advise.