-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange characters show up when moving a cursor over hindi #72
Comments
This is directly related to the missing handling of combining characters mentioned in #62 |
This is what I get under Linux and FF with default monospace font (Liberation Sans Mono) at 30px: @parisk - as you stated here #144 (comment), the chars under the cursor change to something different. I cant reproduce this - for me they are the perfect inserve of the normal chars. Maybe it is font related? Could you try it with different fonts? Also some foreign character sets show weird ligature or stacking behavior (like "half combining" with different glyph output if written next to each other) - thats the ugliest part of the unicode specs and will never really fit into a monospaced environment. Another problem shown in the pics - the glyph widths are neither half nor full width. Idk what is going on - a wild guess is that the font renderer is falling back to some other not monospaced font to get something shown or the font maker just didnt make those glyphs monospaced at all. Either way it will break the cell grid. (I added the 'm' to see the m-width, which should be halfwidth in a monospace font.) |
I got the same issue with Courier New and Source Code Pro. I do not think that this is a font issue. I guess that it is an issue with wcwidth, because I also tried comparing a hindi string with a cursor on top of it, with the same string without having the cursor on it and they were the same. |
Maybe this is caused by the ugly construction rules in unicode --> http://unicode.org/faq/indic.html#17 (see Q: I cannot find the "half forms" of Devanagari letters (or any other Indic script) in the Unicode code charts. These characters are needed to form words such as "patni".) For comparison Ive added xterm's wcwidth implementation down below. It differs in some combining characters but also lacks handling of those higher order rules of the unicode specs. I fear they never got adapted to a terminal env at all. int mk_wcwidth(wchar_t ucs)
{
unsigned long cmp = (unsigned long) ucs;
/* sorted list of non-overlapping intervals of non-spacing characters */
/* generated by
* uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c
*/
static const struct interval combining[] = {
{ 0x0300, 0x036F }, { 0x0483, 0x0489 }, { 0x0591, 0x05BD },
{ 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 }, { 0x05C4, 0x05C5 },
{ 0x05C7, 0x05C7 }, { 0x0600, 0x0604 }, { 0x0610, 0x061A },
{ 0x064B, 0x065F }, { 0x0670, 0x0670 }, { 0x06D6, 0x06DD },
{ 0x06DF, 0x06E4 }, { 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED },
{ 0x070F, 0x070F }, { 0x0711, 0x0711 }, { 0x0730, 0x074A },
{ 0x07A6, 0x07B0 }, { 0x07EB, 0x07F3 }, { 0x0816, 0x0819 },
{ 0x081B, 0x0823 }, { 0x0825, 0x0827 }, { 0x0829, 0x082D },
{ 0x0859, 0x085B }, { 0x08E4, 0x08FE }, { 0x0900, 0x0902 },
{ 0x093A, 0x093A }, { 0x093C, 0x093C }, { 0x0941, 0x0948 },
{ 0x094D, 0x094D }, { 0x0951, 0x0957 }, { 0x0962, 0x0963 },
{ 0x0981, 0x0981 }, { 0x09BC, 0x09BC }, { 0x09C1, 0x09C4 },
{ 0x09CD, 0x09CD }, { 0x09E2, 0x09E3 }, { 0x0A01, 0x0A02 },
{ 0x0A3C, 0x0A3C }, { 0x0A41, 0x0A42 }, { 0x0A47, 0x0A48 },
{ 0x0A4B, 0x0A4D }, { 0x0A51, 0x0A51 }, { 0x0A70, 0x0A71 },
{ 0x0A75, 0x0A75 }, { 0x0A81, 0x0A82 }, { 0x0ABC, 0x0ABC },
{ 0x0AC1, 0x0AC5 }, { 0x0AC7, 0x0AC8 }, { 0x0ACD, 0x0ACD },
{ 0x0AE2, 0x0AE3 }, { 0x0B01, 0x0B01 }, { 0x0B3C, 0x0B3C },
{ 0x0B3F, 0x0B3F }, { 0x0B41, 0x0B44 }, { 0x0B4D, 0x0B4D },
{ 0x0B56, 0x0B56 }, { 0x0B62, 0x0B63 }, { 0x0B82, 0x0B82 },
{ 0x0BC0, 0x0BC0 }, { 0x0BCD, 0x0BCD }, { 0x0C3E, 0x0C40 },
{ 0x0C46, 0x0C48 }, { 0x0C4A, 0x0C4D }, { 0x0C55, 0x0C56 },
{ 0x0C62, 0x0C63 }, { 0x0CBC, 0x0CBC }, { 0x0CBF, 0x0CBF },
{ 0x0CC6, 0x0CC6 }, { 0x0CCC, 0x0CCD }, { 0x0CE2, 0x0CE3 },
{ 0x0D41, 0x0D44 }, { 0x0D4D, 0x0D4D }, { 0x0D62, 0x0D63 },
{ 0x0DCA, 0x0DCA }, { 0x0DD2, 0x0DD4 }, { 0x0DD6, 0x0DD6 },
{ 0x0E31, 0x0E31 }, { 0x0E34, 0x0E3A }, { 0x0E47, 0x0E4E },
{ 0x0EB1, 0x0EB1 }, { 0x0EB4, 0x0EB9 }, { 0x0EBB, 0x0EBC },
{ 0x0EC8, 0x0ECD }, { 0x0F18, 0x0F19 }, { 0x0F35, 0x0F35 },
{ 0x0F37, 0x0F37 }, { 0x0F39, 0x0F39 }, { 0x0F71, 0x0F7E },
{ 0x0F80, 0x0F84 }, { 0x0F86, 0x0F87 }, { 0x0F8D, 0x0F97 },
{ 0x0F99, 0x0FBC }, { 0x0FC6, 0x0FC6 }, { 0x102D, 0x1030 },
{ 0x1032, 0x1037 }, { 0x1039, 0x103A }, { 0x103D, 0x103E },
{ 0x1058, 0x1059 }, { 0x105E, 0x1060 }, { 0x1071, 0x1074 },
{ 0x1082, 0x1082 }, { 0x1085, 0x1086 }, { 0x108D, 0x108D },
{ 0x109D, 0x109D }, { 0x1160, 0x11FF }, { 0x135D, 0x135F },
{ 0x1712, 0x1714 }, { 0x1732, 0x1734 }, { 0x1752, 0x1753 },
{ 0x1772, 0x1773 }, { 0x17B4, 0x17B5 }, { 0x17B7, 0x17BD },
{ 0x17C6, 0x17C6 }, { 0x17C9, 0x17D3 }, { 0x17DD, 0x17DD },
{ 0x180B, 0x180D }, { 0x18A9, 0x18A9 }, { 0x1920, 0x1922 },
{ 0x1927, 0x1928 }, { 0x1932, 0x1932 }, { 0x1939, 0x193B },
{ 0x1A17, 0x1A18 }, { 0x1A56, 0x1A56 }, { 0x1A58, 0x1A5E },
{ 0x1A60, 0x1A60 }, { 0x1A62, 0x1A62 }, { 0x1A65, 0x1A6C },
{ 0x1A73, 0x1A7C }, { 0x1A7F, 0x1A7F }, { 0x1B00, 0x1B03 },
{ 0x1B34, 0x1B34 }, { 0x1B36, 0x1B3A }, { 0x1B3C, 0x1B3C },
{ 0x1B42, 0x1B42 }, { 0x1B6B, 0x1B73 }, { 0x1B80, 0x1B81 },
{ 0x1BA2, 0x1BA5 }, { 0x1BA8, 0x1BA9 }, { 0x1BAB, 0x1BAB },
{ 0x1BE6, 0x1BE6 }, { 0x1BE8, 0x1BE9 }, { 0x1BED, 0x1BED },
{ 0x1BEF, 0x1BF1 }, { 0x1C2C, 0x1C33 }, { 0x1C36, 0x1C37 },
{ 0x1CD0, 0x1CD2 }, { 0x1CD4, 0x1CE0 }, { 0x1CE2, 0x1CE8 },
{ 0x1CED, 0x1CED }, { 0x1CF4, 0x1CF4 }, { 0x1DC0, 0x1DE6 },
{ 0x1DFC, 0x1DFF }, { 0x200B, 0x200F }, { 0x202A, 0x202E },
{ 0x2060, 0x2064 }, { 0x206A, 0x206F }, { 0x20D0, 0x20F0 },
{ 0x2CEF, 0x2CF1 }, { 0x2D7F, 0x2D7F }, { 0x2DE0, 0x2DFF },
{ 0x302A, 0x302D }, { 0x3099, 0x309A }, { 0xA66F, 0xA672 },
{ 0xA674, 0xA67D }, { 0xA69F, 0xA69F }, { 0xA6F0, 0xA6F1 },
{ 0xA802, 0xA802 }, { 0xA806, 0xA806 }, { 0xA80B, 0xA80B },
{ 0xA825, 0xA826 }, { 0xA8C4, 0xA8C4 }, { 0xA8E0, 0xA8F1 },
{ 0xA926, 0xA92D }, { 0xA947, 0xA951 }, { 0xA980, 0xA982 },
{ 0xA9B3, 0xA9B3 }, { 0xA9B6, 0xA9B9 }, { 0xA9BC, 0xA9BC },
{ 0xAA29, 0xAA2E }, { 0xAA31, 0xAA32 }, { 0xAA35, 0xAA36 },
{ 0xAA43, 0xAA43 }, { 0xAA4C, 0xAA4C }, { 0xAAB0, 0xAAB0 },
{ 0xAAB2, 0xAAB4 }, { 0xAAB7, 0xAAB8 }, { 0xAABE, 0xAABF },
{ 0xAAC1, 0xAAC1 }, { 0xAAEC, 0xAAED }, { 0xAAF6, 0xAAF6 },
{ 0xABE5, 0xABE5 }, { 0xABE8, 0xABE8 }, { 0xABED, 0xABED },
{ 0xFB1E, 0xFB1E }, { 0xFE00, 0xFE0F }, { 0xFE20, 0xFE26 },
{ 0xFEFF, 0xFEFF }, { 0xFFF9, 0xFFFB }, { 0x101FD, 0x101FD },
{ 0x10A01, 0x10A03 }, { 0x10A05, 0x10A06 }, { 0x10A0C, 0x10A0F },
{ 0x10A38, 0x10A3A }, { 0x10A3F, 0x10A3F }, { 0x11001, 0x11001 },
{ 0x11038, 0x11046 }, { 0x11080, 0x11081 }, { 0x110B3, 0x110B6 },
{ 0x110B9, 0x110BA }, { 0x110BD, 0x110BD }, { 0x11100, 0x11102 },
{ 0x11127, 0x1112B }, { 0x1112D, 0x11134 }, { 0x11180, 0x11181 },
{ 0x111B6, 0x111BE }, { 0x116AB, 0x116AB }, { 0x116AD, 0x116AD },
{ 0x116B0, 0x116B5 }, { 0x116B7, 0x116B7 }, { 0x16F8F, 0x16F92 },
{ 0x1D167, 0x1D169 }, { 0x1D173, 0x1D182 }, { 0x1D185, 0x1D18B },
{ 0x1D1AA, 0x1D1AD }, { 0x1D242, 0x1D244 }, { 0xE0001, 0xE0001 },
{ 0xE0020, 0xE007F }, { 0xE0100, 0xE01EF }
};
/* test for 8-bit control characters */
if (cmp == 0)
return 0;
if (cmp < 32 || (cmp >= 0x7f && cmp < 0xa0))
return -1;
/* binary search in table of non-spacing characters */
if (bisearch(cmp, combining,
(int) (sizeof(combining) / sizeof(struct interval) - 1)))
return 0;
/* if we arrive here, cmp is not a combining or C0/C1 control character */
return 1 +
(cmp >= 0x1100 &&
(cmp <= 0x115f || /* Hangul Jamo init. consonants */
cmp == 0x2329 || cmp == 0x232a ||
(cmp >= 0x2e80 && cmp <= 0xa4cf &&
cmp != 0x303f) || /* CJK ... Yi */
(cmp >= 0xac00 && cmp <= 0xd7a3) || /* Hangul Syllables */
(cmp >= 0xf900 && cmp <= 0xfaff) || /* CJK Compatibility Ideographs */
(cmp >= 0xfe10 && cmp <= 0xfe19) || /* Vertical forms */
(cmp >= 0xfe30 && cmp <= 0xfe6f) || /* CJK Compatibility Forms */
(cmp >= 0xff00 && cmp <= 0xff60) || /* Fullwidth Forms */
(cmp >= 0xffe0 && cmp <= 0xffe6) ||
(cmp >= 0x20000 && cmp <= 0x2fffd) ||
(cmp >= 0x30000 && cmp <= 0x3fffd)));
} |
Thanks a lot for providing this information @jerch. I do not have the time to deal with this at the moment, but they will definitely be helpful when someone gets on top of this issue. |
Maybe real grapheme support will solve this http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries |
This should probably be left for someone familiar with the IME, perhaps it's behaving as expected? |
@jerch this should be part of the renderer or the font? I don't have much experience with international characters, so my questions might be quite basic. |
@parisk Kinda both, a font will have some those rules implemented to do the combined drawing and ligature stuff which results in so called user perceived characters. This is the final composed "sign" a user would recognize as one character. A terminal with the monospace env makes this even more complicated - it all has to fit somehow into a cell grid. Example - this is perfect legal in unicode (example from here):
It will break all terminal emulators out there - cursor in wrong postition, overdrawn rubbish, wrong cell widths and heights. It is also a good example to play with different fonts in a wordprocessor, the output will be quite different. |
I am certain that there is no solution that fits them all. I believe that the most important part is to decide what Unicode features does xterm.js support and not put any effort to anything beyond these. Not sure how big each segment is though at the moment (CJK, Hindi, ancient indogerman etc.) in order to decide which ones have the most impact. |
Closing since I did the IME work recently and we all don't know enough about Hindi to verify correct behavior 😄 |
Text: अर्धतत्सम
Characters change:
Can't move further back that this:
The text was updated successfully, but these errors were encountered: