Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CJK characters are not treated as double width chars #62

Closed
Tyriar opened this issue Jun 2, 2016 · 9 comments
Closed

CJK characters are not treated as double width chars #62

Tyriar opened this issue Jun 2, 2016 · 9 comments
Labels
type/bug Something is misbehaving

Comments

@Tyriar
Copy link
Member

Tyriar commented Jun 2, 2016

See chjj/term.js#96

image

vs gnome-terminal:

image

Notice the characters are exactly 2 ascii characters in width

@parisk parisk added the type/bug Something is misbehaving label Jun 2, 2016
@parisk
Copy link
Contributor

parisk commented Jun 3, 2016

Thanks for reporting. Will take a look at this along with the rest international character issues.

@parisk
Copy link
Contributor

parisk commented Jun 7, 2016

Seems like the weird line break is actually the div overflowing. This indeed happens because these are double byte characters, the terminal ignores this fact and counts them as single-byte ones.

Working on a fix.

Screenshot

image

@jerch
Copy link
Member

jerch commented Jun 13, 2016

@parisk To support this you gonna have to include the wcwidth calculation for any Unicode codepoint (see man wcwidth) and adjust the taken space for fullwidth characters to 2 terminal cells (terminals are based on the idea that a single cell can hold one halfwidth character since all western characters are halfwidth).
Problem with Javascript and Unicode is - it is a total mess with the UTF-16 encoding (surrogates) and very expensive to calculate. This gets even more complicated if you plan to support stackable combining characters (wcwidth will report a width of zero for those). This might be easier to implement once the Unicode functions with real codepoints are widely adapted by the JS engines.
I tried to implement this according to the Unicode spec but well - it ended up as a total code mess in my own terminal emulator.
Original source of wcwidth: https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

@jerch
Copy link
Member

jerch commented Jun 16, 2016

I had a look at your line at cell abstraction - this is pretty much the same I did in my emulator. If you like I can extend your isWide and write functions with the wcwidth stuff in a PR.
Another problem is the rendering with Unicode. For higher BMP or non BMP codepoints the fonts might differ in the glyph width from wcwidth (if they have that glyph at all). You are likely to need a CSS class fixing misaligned glyphs into place to keep the output a well formed cell grid.

@parisk
Copy link
Contributor

parisk commented Jun 16, 2016

Interesting. Do you believe that this could be handy? http://code.woong.org/wcwidth.js

@jerch
Copy link
Member

jerch commented Jun 16, 2016

yes, it should do the trick, though it is not 100% xterm compatible (xterm uses a slightly different version of it). If xterm compatibility is a major concern for xterm.js, we'd have to strip the lookup tables from their sources.

@Tyriar
Copy link
Member Author

Tyriar commented Jun 16, 2016

You can look at a solution to this here chjj/term.js#97

@jerch
Copy link
Member

jerch commented Jun 17, 2016

That seems to work for BMP fullwidth characters. Surrogates are still failing in width and cursor positioning as you can test with these characters: 𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡
No clue if any character set of the other planes is important enough to implement the dirty surrogate handling (see Polyfill here https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt). At least apple uses some codepoints of the private planes for their emojies.
Also combining characters are not handled by this as you can test with 'cafe\u0301'. This would be a small fix to the existing code, since the combining char has no width and would simply end up in the last active terminal cell --> ['c', 'a', 'f', 'e\u0301']. It is still a mess at the last cell in row though.

@parisk
Copy link
Contributor

parisk commented Jun 27, 2016

Closed in #144.

@parisk parisk closed this as completed Jun 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something is misbehaving
Projects
None yet
Development

No branches or pull requests

3 participants