Strange characters show up when moving a cursor over hindi #72

Tyriar · 2016-06-02T09:02:13Z

Text: अर्धतत्सम

Characters change:

Can't move further back that this:

jerch · 2016-06-20T10:08:03Z

This is directly related to the missing handling of combining characters mentioned in #62

jerch · 2016-06-27T17:14:47Z

This is what I get under Linux and FF with default monospace font (Liberation Sans Mono) at 30px:

@parisk - as you stated here #144 (comment), the chars under the cursor change to something different. I cant reproduce this - for me they are the perfect inserve of the normal chars. Maybe it is font related? Could you try it with different fonts? Also some foreign character sets show weird ligature or stacking behavior (like "half combining" with different glyph output if written next to each other) - thats the ugliest part of the unicode specs and will never really fit into a monospaced environment.

Another problem shown in the pics - the glyph widths are neither half nor full width. Idk what is going on - a wild guess is that the font renderer is falling back to some other not monospaced font to get something shown or the font maker just didnt make those glyphs monospaced at all. Either way it will break the cell grid. (I added the 'm' to see the m-width, which should be halfwidth in a monospace font.)

parisk · 2016-06-28T09:04:54Z

I got the same issue with Courier New and Source Code Pro. I do not think that this is a font issue.

I guess that it is an issue with wcwidth, because I also tried comparing a hindi string with a cursor on top of it, with the same string without having the cursor on it and they were the same.

jerch · 2016-06-28T18:20:07Z

Maybe this is caused by the ugly construction rules in unicode --> http://unicode.org/faq/indic.html#17 (see Q: I cannot find the "half forms" of Devanagari letters (or any other Indic script) in the Unicode code charts. These characters are needed to form words such as "patni".)
Neither wcwidthnor the actual cursor handling marked with a span in xterm.js can handle those special cases at the moment. I am not even sure, if there is any monospaced font that prints those characters correctly, xterm gives me just a blank line. Maybe a typographer from India with some terminal addiction is needed for clarification.

For comparison Ive added xterm's wcwidth implementation down below. It differs in some combining characters but also lacks handling of those higher order rules of the unicode specs. I fear they never got adapted to a terminal env at all.

int mk_wcwidth(wchar_t ucs)
{
  unsigned long cmp = (unsigned long) ucs;

  /* sorted list of non-overlapping intervals of non-spacing characters */
  /* generated by
   *    uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c
   */
  static const struct interval combining[] = {
    { 0x0300, 0x036F }, { 0x0483, 0x0489 }, { 0x0591, 0x05BD },
    { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 }, { 0x05C4, 0x05C5 },
    { 0x05C7, 0x05C7 }, { 0x0600, 0x0604 }, { 0x0610, 0x061A },
    { 0x064B, 0x065F }, { 0x0670, 0x0670 }, { 0x06D6, 0x06DD },
    { 0x06DF, 0x06E4 }, { 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED },
    { 0x070F, 0x070F }, { 0x0711, 0x0711 }, { 0x0730, 0x074A },
    { 0x07A6, 0x07B0 }, { 0x07EB, 0x07F3 }, { 0x0816, 0x0819 },
    { 0x081B, 0x0823 }, { 0x0825, 0x0827 }, { 0x0829, 0x082D },
    { 0x0859, 0x085B }, { 0x08E4, 0x08FE }, { 0x0900, 0x0902 },
    { 0x093A, 0x093A }, { 0x093C, 0x093C }, { 0x0941, 0x0948 },
    { 0x094D, 0x094D }, { 0x0951, 0x0957 }, { 0x0962, 0x0963 },
    { 0x0981, 0x0981 }, { 0x09BC, 0x09BC }, { 0x09C1, 0x09C4 },
    { 0x09CD, 0x09CD }, { 0x09E2, 0x09E3 }, { 0x0A01, 0x0A02 },
    { 0x0A3C, 0x0A3C }, { 0x0A41, 0x0A42 }, { 0x0A47, 0x0A48 },
    { 0x0A4B, 0x0A4D }, { 0x0A51, 0x0A51 }, { 0x0A70, 0x0A71 },
    { 0x0A75, 0x0A75 }, { 0x0A81, 0x0A82 }, { 0x0ABC, 0x0ABC },
    { 0x0AC1, 0x0AC5 }, { 0x0AC7, 0x0AC8 }, { 0x0ACD, 0x0ACD },
    { 0x0AE2, 0x0AE3 }, { 0x0B01, 0x0B01 }, { 0x0B3C, 0x0B3C },
    { 0x0B3F, 0x0B3F }, { 0x0B41, 0x0B44 }, { 0x0B4D, 0x0B4D },
    { 0x0B56, 0x0B56 }, { 0x0B62, 0x0B63 }, { 0x0B82, 0x0B82 },
    { 0x0BC0, 0x0BC0 }, { 0x0BCD, 0x0BCD }, { 0x0C3E, 0x0C40 },
    { 0x0C46, 0x0C48 }, { 0x0C4A, 0x0C4D }, { 0x0C55, 0x0C56 },
    { 0x0C62, 0x0C63 }, { 0x0CBC, 0x0CBC }, { 0x0CBF, 0x0CBF },
    { 0x0CC6, 0x0CC6 }, { 0x0CCC, 0x0CCD }, { 0x0CE2, 0x0CE3 },
    { 0x0D41, 0x0D44 }, { 0x0D4D, 0x0D4D }, { 0x0D62, 0x0D63 },
    { 0x0DCA, 0x0DCA }, { 0x0DD2, 0x0DD4 }, { 0x0DD6, 0x0DD6 },
    { 0x0E31, 0x0E31 }, { 0x0E34, 0x0E3A }, { 0x0E47, 0x0E4E },
    { 0x0EB1, 0x0EB1 }, { 0x0EB4, 0x0EB9 }, { 0x0EBB, 0x0EBC },
    { 0x0EC8, 0x0ECD }, { 0x0F18, 0x0F19 }, { 0x0F35, 0x0F35 },
    { 0x0F37, 0x0F37 }, { 0x0F39, 0x0F39 }, { 0x0F71, 0x0F7E },
    { 0x0F80, 0x0F84 }, { 0x0F86, 0x0F87 }, { 0x0F8D, 0x0F97 },
    { 0x0F99, 0x0FBC }, { 0x0FC6, 0x0FC6 }, { 0x102D, 0x1030 },
    { 0x1032, 0x1037 }, { 0x1039, 0x103A }, { 0x103D, 0x103E },
    { 0x1058, 0x1059 }, { 0x105E, 0x1060 }, { 0x1071, 0x1074 },
    { 0x1082, 0x1082 }, { 0x1085, 0x1086 }, { 0x108D, 0x108D },
    { 0x109D, 0x109D }, { 0x1160, 0x11FF }, { 0x135D, 0x135F },
    { 0x1712, 0x1714 }, { 0x1732, 0x1734 }, { 0x1752, 0x1753 },
    { 0x1772, 0x1773 }, { 0x17B4, 0x17B5 }, { 0x17B7, 0x17BD },
    { 0x17C6, 0x17C6 }, { 0x17C9, 0x17D3 }, { 0x17DD, 0x17DD },
    { 0x180B, 0x180D }, { 0x18A9, 0x18A9 }, { 0x1920, 0x1922 },
    { 0x1927, 0x1928 }, { 0x1932, 0x1932 }, { 0x1939, 0x193B },
    { 0x1A17, 0x1A18 }, { 0x1A56, 0x1A56 }, { 0x1A58, 0x1A5E },
    { 0x1A60, 0x1A60 }, { 0x1A62, 0x1A62 }, { 0x1A65, 0x1A6C },
    { 0x1A73, 0x1A7C }, { 0x1A7F, 0x1A7F }, { 0x1B00, 0x1B03 },
    { 0x1B34, 0x1B34 }, { 0x1B36, 0x1B3A }, { 0x1B3C, 0x1B3C },
    { 0x1B42, 0x1B42 }, { 0x1B6B, 0x1B73 }, { 0x1B80, 0x1B81 },
    { 0x1BA2, 0x1BA5 }, { 0x1BA8, 0x1BA9 }, { 0x1BAB, 0x1BAB },
    { 0x1BE6, 0x1BE6 }, { 0x1BE8, 0x1BE9 }, { 0x1BED, 0x1BED },
    { 0x1BEF, 0x1BF1 }, { 0x1C2C, 0x1C33 }, { 0x1C36, 0x1C37 },
    { 0x1CD0, 0x1CD2 }, { 0x1CD4, 0x1CE0 }, { 0x1CE2, 0x1CE8 },
    { 0x1CED, 0x1CED }, { 0x1CF4, 0x1CF4 }, { 0x1DC0, 0x1DE6 },
    { 0x1DFC, 0x1DFF }, { 0x200B, 0x200F }, { 0x202A, 0x202E },
    { 0x2060, 0x2064 }, { 0x206A, 0x206F }, { 0x20D0, 0x20F0 },
    { 0x2CEF, 0x2CF1 }, { 0x2D7F, 0x2D7F }, { 0x2DE0, 0x2DFF },
    { 0x302A, 0x302D }, { 0x3099, 0x309A }, { 0xA66F, 0xA672 },
    { 0xA674, 0xA67D }, { 0xA69F, 0xA69F }, { 0xA6F0, 0xA6F1 },
    { 0xA802, 0xA802 }, { 0xA806, 0xA806 }, { 0xA80B, 0xA80B },
    { 0xA825, 0xA826 }, { 0xA8C4, 0xA8C4 }, { 0xA8E0, 0xA8F1 },
    { 0xA926, 0xA92D }, { 0xA947, 0xA951 }, { 0xA980, 0xA982 },
    { 0xA9B3, 0xA9B3 }, { 0xA9B6, 0xA9B9 }, { 0xA9BC, 0xA9BC },
    { 0xAA29, 0xAA2E }, { 0xAA31, 0xAA32 }, { 0xAA35, 0xAA36 },
    { 0xAA43, 0xAA43 }, { 0xAA4C, 0xAA4C }, { 0xAAB0, 0xAAB0 },
    { 0xAAB2, 0xAAB4 }, { 0xAAB7, 0xAAB8 }, { 0xAABE, 0xAABF },
    { 0xAAC1, 0xAAC1 }, { 0xAAEC, 0xAAED }, { 0xAAF6, 0xAAF6 },
    { 0xABE5, 0xABE5 }, { 0xABE8, 0xABE8 }, { 0xABED, 0xABED },
    { 0xFB1E, 0xFB1E }, { 0xFE00, 0xFE0F }, { 0xFE20, 0xFE26 },
    { 0xFEFF, 0xFEFF }, { 0xFFF9, 0xFFFB }, { 0x101FD, 0x101FD },
    { 0x10A01, 0x10A03 }, { 0x10A05, 0x10A06 }, { 0x10A0C, 0x10A0F },
    { 0x10A38, 0x10A3A }, { 0x10A3F, 0x10A3F }, { 0x11001, 0x11001 },
    { 0x11038, 0x11046 }, { 0x11080, 0x11081 }, { 0x110B3, 0x110B6 },
    { 0x110B9, 0x110BA }, { 0x110BD, 0x110BD }, { 0x11100, 0x11102 },
    { 0x11127, 0x1112B }, { 0x1112D, 0x11134 }, { 0x11180, 0x11181 },
    { 0x111B6, 0x111BE }, { 0x116AB, 0x116AB }, { 0x116AD, 0x116AD },
    { 0x116B0, 0x116B5 }, { 0x116B7, 0x116B7 }, { 0x16F8F, 0x16F92 },
    { 0x1D167, 0x1D169 }, { 0x1D173, 0x1D182 }, { 0x1D185, 0x1D18B },
    { 0x1D1AA, 0x1D1AD }, { 0x1D242, 0x1D244 }, { 0xE0001, 0xE0001 },
    { 0xE0020, 0xE007F }, { 0xE0100, 0xE01EF }
  };

  /* test for 8-bit control characters */
  if (cmp == 0)
    return 0;
  if (cmp < 32 || (cmp >= 0x7f && cmp < 0xa0))
    return -1;

  /* binary search in table of non-spacing characters */
  if (bisearch(cmp, combining,
               (int) (sizeof(combining) / sizeof(struct interval) - 1)))
    return 0;

  /* if we arrive here, cmp is not a combining or C0/C1 control character */

  return 1 +
    (cmp >= 0x1100 &&
     (cmp <= 0x115f ||                    /* Hangul Jamo init. consonants */
      cmp == 0x2329 || cmp == 0x232a ||
      (cmp >= 0x2e80 && cmp <= 0xa4cf &&
       cmp != 0x303f) ||                  /* CJK ... Yi */
      (cmp >= 0xac00 && cmp <= 0xd7a3) || /* Hangul Syllables */
      (cmp >= 0xf900 && cmp <= 0xfaff) || /* CJK Compatibility Ideographs */
      (cmp >= 0xfe10 && cmp <= 0xfe19) || /* Vertical forms */
      (cmp >= 0xfe30 && cmp <= 0xfe6f) || /* CJK Compatibility Forms */
      (cmp >= 0xff00 && cmp <= 0xff60) || /* Fullwidth Forms */
      (cmp >= 0xffe0 && cmp <= 0xffe6) ||
      (cmp >= 0x20000 && cmp <= 0x2fffd) ||
      (cmp >= 0x30000 && cmp <= 0x3fffd)));
}

parisk · 2016-06-29T22:36:01Z

Thanks a lot for providing this information @jerch. I do not have the time to deal with this at the moment, but they will definitely be helpful when someone gets on top of this issue.

jerch · 2016-07-08T12:20:52Z

Maybe real grapheme support will solve this http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

Tyriar · 2016-07-08T18:29:32Z

This should probably be left for someone familiar with the IME, perhaps it's behaving as expected?

parisk · 2016-07-08T22:29:11Z

@jerch this should be part of the renderer or the font? I don't have much experience with international characters, so my questions might be quite basic.

jerch · 2016-07-08T23:30:33Z

@parisk Kinda both, a font will have some those rules implemented to do the combined drawing and ligature stuff which results in so called user perceived characters. This is the final composed "sign" a user would recognize as one character.
One the other hand a renderer should handle this accordingly, for example the cursor should "see" it as one character as default behavior (no jumps between the codepoints). This raises the question how to actually insert/edit/break those characters into pieces - therefore the different IME approaches. This is a big problem for kinda all editors/wordprocessors and always only partly solved, unicode is just broken at this point (we had a customer dealing with ancient indogerman languages - the components for the weird chars are all in the unicode spec but there is like no tool to input it correctly and only one font capable to draw 70% of them.)

A terminal with the monospace env makes this even more complicated - it all has to fit somehow into a cell grid. Example - this is perfect legal in unicode (example from here):

'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'

It will break all terminal emulators out there - cursor in wrong postition, overdrawn rubbish, wrong cell widths and heights. It is also a good example to play with different fonts in a wordprocessor, the output will be quite different.
Here are some thoughts about Javascript and unicode. Somewhat disheartening.

parisk · 2016-07-11T18:21:56Z

I am certain that there is no solution that fits them all. I believe that the most important part is to decide what Unicode features does xterm.js support and not put any effort to anything beyond these.

Not sure how big each segment is though at the moment (CJK, Hindi, ancient indogerman etc.) in order to decide which ones have the most impact.

Tyriar · 2016-09-17T11:58:17Z

Closing since I did the IME work recently and we all don't know enough about Hindi to verify correct behavior 😄

jerch mentioned this issue Jun 24, 2016

wcwidth calculation #144

Merged

Tyriar closed this as completed Sep 17, 2016

jerch mentioned this issue May 20, 2018

Support options to treat ambiguous width characters as double width #1453

Closed

shreevatsa mentioned this issue Nov 27, 2018

Wrong width for Hindi on macOS, but correct width on Linux jquast/wcwidth#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange characters show up when moving a cursor over hindi #72

Strange characters show up when moving a cursor over hindi #72

Tyriar commented Jun 2, 2016 •

edited

Loading

jerch commented Jun 20, 2016

jerch commented Jun 27, 2016

parisk commented Jun 28, 2016

jerch commented Jun 28, 2016 •

edited

Loading

parisk commented Jun 29, 2016

jerch commented Jul 8, 2016 •

edited

Loading

Tyriar commented Jul 8, 2016

parisk commented Jul 8, 2016

jerch commented Jul 8, 2016 •

edited

Loading

parisk commented Jul 11, 2016

Tyriar commented Sep 17, 2016

Strange characters show up when moving a cursor over hindi #72

Strange characters show up when moving a cursor over hindi #72

Comments

Tyriar commented Jun 2, 2016 • edited Loading

jerch commented Jun 20, 2016

jerch commented Jun 27, 2016

parisk commented Jun 28, 2016

jerch commented Jun 28, 2016 • edited Loading

parisk commented Jun 29, 2016

jerch commented Jul 8, 2016 • edited Loading

Tyriar commented Jul 8, 2016

parisk commented Jul 8, 2016

jerch commented Jul 8, 2016 • edited Loading

parisk commented Jul 11, 2016

Tyriar commented Sep 17, 2016

Tyriar commented Jun 2, 2016 •

edited

Loading

jerch commented Jun 28, 2016 •

edited

Loading

jerch commented Jul 8, 2016 •

edited

Loading

jerch commented Jul 8, 2016 •

edited

Loading