Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange characters show up when moving a cursor over hindi #72

Closed
Tyriar opened this issue Jun 2, 2016 · 11 comments
Closed

Strange characters show up when moving a cursor over hindi #72

Tyriar opened this issue Jun 2, 2016 · 11 comments

Comments

@Tyriar
Copy link
Member

Tyriar commented Jun 2, 2016

Text: अर्धतत्सम

image

image

Characters change:

image

Can't move further back that this:

image

@jerch
Copy link
Member

jerch commented Jun 20, 2016

This is directly related to the missing handling of combining characters mentioned in #62

@jerch
Copy link
Member

jerch commented Jun 27, 2016

This is what I get under Linux and FF with default monospace font (Liberation Sans Mono) at 30px:

1
2
3
4
5
6
7

@parisk - as you stated here #144 (comment), the chars under the cursor change to something different. I cant reproduce this - for me they are the perfect inserve of the normal chars. Maybe it is font related? Could you try it with different fonts? Also some foreign character sets show weird ligature or stacking behavior (like "half combining" with different glyph output if written next to each other) - thats the ugliest part of the unicode specs and will never really fit into a monospaced environment.

Another problem shown in the pics - the glyph widths are neither half nor full width. Idk what is going on - a wild guess is that the font renderer is falling back to some other not monospaced font to get something shown or the font maker just didnt make those glyphs monospaced at all. Either way it will break the cell grid. (I added the 'm' to see the m-width, which should be halfwidth in a monospace font.)

@parisk
Copy link
Contributor

parisk commented Jun 28, 2016

I got the same issue with Courier New and Source Code Pro. I do not think that this is a font issue.

I guess that it is an issue with wcwidth, because I also tried comparing a hindi string with a cursor on top of it, with the same string without having the cursor on it and they were the same.

@jerch
Copy link
Member

jerch commented Jun 28, 2016

Maybe this is caused by the ugly construction rules in unicode --> http://unicode.org/faq/indic.html#17 (see Q: I cannot find the "half forms" of Devanagari letters (or any other Indic script) in the Unicode code charts. These characters are needed to form words such as "patni".)
Neither wcwidthnor the actual cursor handling marked with a span in xterm.js can handle those special cases at the moment. I am not even sure, if there is any monospaced font that prints those characters correctly, xterm gives me just a blank line. Maybe a typographer from India with some terminal addiction is needed for clarification.

For comparison Ive added xterm's wcwidth implementation down below. It differs in some combining characters but also lacks handling of those higher order rules of the unicode specs. I fear they never got adapted to a terminal env at all.

int mk_wcwidth(wchar_t ucs)
{
  unsigned long cmp = (unsigned long) ucs;

  /* sorted list of non-overlapping intervals of non-spacing characters */
  /* generated by
   *    uniset +cat=Me +cat=Mn +cat=Cf -00AD +1160-11FF +200B c
   */
  static const struct interval combining[] = {
    { 0x0300, 0x036F }, { 0x0483, 0x0489 }, { 0x0591, 0x05BD },
    { 0x05BF, 0x05BF }, { 0x05C1, 0x05C2 }, { 0x05C4, 0x05C5 },
    { 0x05C7, 0x05C7 }, { 0x0600, 0x0604 }, { 0x0610, 0x061A },
    { 0x064B, 0x065F }, { 0x0670, 0x0670 }, { 0x06D6, 0x06DD },
    { 0x06DF, 0x06E4 }, { 0x06E7, 0x06E8 }, { 0x06EA, 0x06ED },
    { 0x070F, 0x070F }, { 0x0711, 0x0711 }, { 0x0730, 0x074A },
    { 0x07A6, 0x07B0 }, { 0x07EB, 0x07F3 }, { 0x0816, 0x0819 },
    { 0x081B, 0x0823 }, { 0x0825, 0x0827 }, { 0x0829, 0x082D },
    { 0x0859, 0x085B }, { 0x08E4, 0x08FE }, { 0x0900, 0x0902 },
    { 0x093A, 0x093A }, { 0x093C, 0x093C }, { 0x0941, 0x0948 },
    { 0x094D, 0x094D }, { 0x0951, 0x0957 }, { 0x0962, 0x0963 },
    { 0x0981, 0x0981 }, { 0x09BC, 0x09BC }, { 0x09C1, 0x09C4 },
    { 0x09CD, 0x09CD }, { 0x09E2, 0x09E3 }, { 0x0A01, 0x0A02 },
    { 0x0A3C, 0x0A3C }, { 0x0A41, 0x0A42 }, { 0x0A47, 0x0A48 },
    { 0x0A4B, 0x0A4D }, { 0x0A51, 0x0A51 }, { 0x0A70, 0x0A71 },
    { 0x0A75, 0x0A75 }, { 0x0A81, 0x0A82 }, { 0x0ABC, 0x0ABC },
    { 0x0AC1, 0x0AC5 }, { 0x0AC7, 0x0AC8 }, { 0x0ACD, 0x0ACD },
    { 0x0AE2, 0x0AE3 }, { 0x0B01, 0x0B01 }, { 0x0B3C, 0x0B3C },
    { 0x0B3F, 0x0B3F }, { 0x0B41, 0x0B44 }, { 0x0B4D, 0x0B4D },
    { 0x0B56, 0x0B56 }, { 0x0B62, 0x0B63 }, { 0x0B82, 0x0B82 },
    { 0x0BC0, 0x0BC0 }, { 0x0BCD, 0x0BCD }, { 0x0C3E, 0x0C40 },
    { 0x0C46, 0x0C48 }, { 0x0C4A, 0x0C4D }, { 0x0C55, 0x0C56 },
    { 0x0C62, 0x0C63 }, { 0x0CBC, 0x0CBC }, { 0x0CBF, 0x0CBF },
    { 0x0CC6, 0x0CC6 }, { 0x0CCC, 0x0CCD }, { 0x0CE2, 0x0CE3 },
    { 0x0D41, 0x0D44 }, { 0x0D4D, 0x0D4D }, { 0x0D62, 0x0D63 },
    { 0x0DCA, 0x0DCA }, { 0x0DD2, 0x0DD4 }, { 0x0DD6, 0x0DD6 },
    { 0x0E31, 0x0E31 }, { 0x0E34, 0x0E3A }, { 0x0E47, 0x0E4E },
    { 0x0EB1, 0x0EB1 }, { 0x0EB4, 0x0EB9 }, { 0x0EBB, 0x0EBC },
    { 0x0EC8, 0x0ECD }, { 0x0F18, 0x0F19 }, { 0x0F35, 0x0F35 },
    { 0x0F37, 0x0F37 }, { 0x0F39, 0x0F39 }, { 0x0F71, 0x0F7E },
    { 0x0F80, 0x0F84 }, { 0x0F86, 0x0F87 }, { 0x0F8D, 0x0F97 },
    { 0x0F99, 0x0FBC }, { 0x0FC6, 0x0FC6 }, { 0x102D, 0x1030 },
    { 0x1032, 0x1037 }, { 0x1039, 0x103A }, { 0x103D, 0x103E },
    { 0x1058, 0x1059 }, { 0x105E, 0x1060 }, { 0x1071, 0x1074 },
    { 0x1082, 0x1082 }, { 0x1085, 0x1086 }, { 0x108D, 0x108D },
    { 0x109D, 0x109D }, { 0x1160, 0x11FF }, { 0x135D, 0x135F },
    { 0x1712, 0x1714 }, { 0x1732, 0x1734 }, { 0x1752, 0x1753 },
    { 0x1772, 0x1773 }, { 0x17B4, 0x17B5 }, { 0x17B7, 0x17BD },
    { 0x17C6, 0x17C6 }, { 0x17C9, 0x17D3 }, { 0x17DD, 0x17DD },
    { 0x180B, 0x180D }, { 0x18A9, 0x18A9 }, { 0x1920, 0x1922 },
    { 0x1927, 0x1928 }, { 0x1932, 0x1932 }, { 0x1939, 0x193B },
    { 0x1A17, 0x1A18 }, { 0x1A56, 0x1A56 }, { 0x1A58, 0x1A5E },
    { 0x1A60, 0x1A60 }, { 0x1A62, 0x1A62 }, { 0x1A65, 0x1A6C },
    { 0x1A73, 0x1A7C }, { 0x1A7F, 0x1A7F }, { 0x1B00, 0x1B03 },
    { 0x1B34, 0x1B34 }, { 0x1B36, 0x1B3A }, { 0x1B3C, 0x1B3C },
    { 0x1B42, 0x1B42 }, { 0x1B6B, 0x1B73 }, { 0x1B80, 0x1B81 },
    { 0x1BA2, 0x1BA5 }, { 0x1BA8, 0x1BA9 }, { 0x1BAB, 0x1BAB },
    { 0x1BE6, 0x1BE6 }, { 0x1BE8, 0x1BE9 }, { 0x1BED, 0x1BED },
    { 0x1BEF, 0x1BF1 }, { 0x1C2C, 0x1C33 }, { 0x1C36, 0x1C37 },
    { 0x1CD0, 0x1CD2 }, { 0x1CD4, 0x1CE0 }, { 0x1CE2, 0x1CE8 },
    { 0x1CED, 0x1CED }, { 0x1CF4, 0x1CF4 }, { 0x1DC0, 0x1DE6 },
    { 0x1DFC, 0x1DFF }, { 0x200B, 0x200F }, { 0x202A, 0x202E },
    { 0x2060, 0x2064 }, { 0x206A, 0x206F }, { 0x20D0, 0x20F0 },
    { 0x2CEF, 0x2CF1 }, { 0x2D7F, 0x2D7F }, { 0x2DE0, 0x2DFF },
    { 0x302A, 0x302D }, { 0x3099, 0x309A }, { 0xA66F, 0xA672 },
    { 0xA674, 0xA67D }, { 0xA69F, 0xA69F }, { 0xA6F0, 0xA6F1 },
    { 0xA802, 0xA802 }, { 0xA806, 0xA806 }, { 0xA80B, 0xA80B },
    { 0xA825, 0xA826 }, { 0xA8C4, 0xA8C4 }, { 0xA8E0, 0xA8F1 },
    { 0xA926, 0xA92D }, { 0xA947, 0xA951 }, { 0xA980, 0xA982 },
    { 0xA9B3, 0xA9B3 }, { 0xA9B6, 0xA9B9 }, { 0xA9BC, 0xA9BC },
    { 0xAA29, 0xAA2E }, { 0xAA31, 0xAA32 }, { 0xAA35, 0xAA36 },
    { 0xAA43, 0xAA43 }, { 0xAA4C, 0xAA4C }, { 0xAAB0, 0xAAB0 },
    { 0xAAB2, 0xAAB4 }, { 0xAAB7, 0xAAB8 }, { 0xAABE, 0xAABF },
    { 0xAAC1, 0xAAC1 }, { 0xAAEC, 0xAAED }, { 0xAAF6, 0xAAF6 },
    { 0xABE5, 0xABE5 }, { 0xABE8, 0xABE8 }, { 0xABED, 0xABED },
    { 0xFB1E, 0xFB1E }, { 0xFE00, 0xFE0F }, { 0xFE20, 0xFE26 },
    { 0xFEFF, 0xFEFF }, { 0xFFF9, 0xFFFB }, { 0x101FD, 0x101FD },
    { 0x10A01, 0x10A03 }, { 0x10A05, 0x10A06 }, { 0x10A0C, 0x10A0F },
    { 0x10A38, 0x10A3A }, { 0x10A3F, 0x10A3F }, { 0x11001, 0x11001 },
    { 0x11038, 0x11046 }, { 0x11080, 0x11081 }, { 0x110B3, 0x110B6 },
    { 0x110B9, 0x110BA }, { 0x110BD, 0x110BD }, { 0x11100, 0x11102 },
    { 0x11127, 0x1112B }, { 0x1112D, 0x11134 }, { 0x11180, 0x11181 },
    { 0x111B6, 0x111BE }, { 0x116AB, 0x116AB }, { 0x116AD, 0x116AD },
    { 0x116B0, 0x116B5 }, { 0x116B7, 0x116B7 }, { 0x16F8F, 0x16F92 },
    { 0x1D167, 0x1D169 }, { 0x1D173, 0x1D182 }, { 0x1D185, 0x1D18B },
    { 0x1D1AA, 0x1D1AD }, { 0x1D242, 0x1D244 }, { 0xE0001, 0xE0001 },
    { 0xE0020, 0xE007F }, { 0xE0100, 0xE01EF }
  };

  /* test for 8-bit control characters */
  if (cmp == 0)
    return 0;
  if (cmp < 32 || (cmp >= 0x7f && cmp < 0xa0))
    return -1;

  /* binary search in table of non-spacing characters */
  if (bisearch(cmp, combining,
               (int) (sizeof(combining) / sizeof(struct interval) - 1)))
    return 0;

  /* if we arrive here, cmp is not a combining or C0/C1 control character */

  return 1 +
    (cmp >= 0x1100 &&
     (cmp <= 0x115f ||                    /* Hangul Jamo init. consonants */
      cmp == 0x2329 || cmp == 0x232a ||
      (cmp >= 0x2e80 && cmp <= 0xa4cf &&
       cmp != 0x303f) ||                  /* CJK ... Yi */
      (cmp >= 0xac00 && cmp <= 0xd7a3) || /* Hangul Syllables */
      (cmp >= 0xf900 && cmp <= 0xfaff) || /* CJK Compatibility Ideographs */
      (cmp >= 0xfe10 && cmp <= 0xfe19) || /* Vertical forms */
      (cmp >= 0xfe30 && cmp <= 0xfe6f) || /* CJK Compatibility Forms */
      (cmp >= 0xff00 && cmp <= 0xff60) || /* Fullwidth Forms */
      (cmp >= 0xffe0 && cmp <= 0xffe6) ||
      (cmp >= 0x20000 && cmp <= 0x2fffd) ||
      (cmp >= 0x30000 && cmp <= 0x3fffd)));
}

@parisk
Copy link
Contributor

parisk commented Jun 29, 2016

Thanks a lot for providing this information @jerch. I do not have the time to deal with this at the moment, but they will definitely be helpful when someone gets on top of this issue.

@jerch
Copy link
Member

jerch commented Jul 8, 2016

Maybe real grapheme support will solve this http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

@Tyriar
Copy link
Member Author

Tyriar commented Jul 8, 2016

This should probably be left for someone familiar with the IME, perhaps it's behaving as expected?

@parisk
Copy link
Contributor

parisk commented Jul 8, 2016

@jerch this should be part of the renderer or the font? I don't have much experience with international characters, so my questions might be quite basic.

@jerch
Copy link
Member

jerch commented Jul 8, 2016

@parisk Kinda both, a font will have some those rules implemented to do the combined drawing and ligature stuff which results in so called user perceived characters. This is the final composed "sign" a user would recognize as one character.
One the other hand a renderer should handle this accordingly, for example the cursor should "see" it as one character as default behavior (no jumps between the codepoints). This raises the question how to actually insert/edit/break those characters into pieces - therefore the different IME approaches. This is a big problem for kinda all editors/wordprocessors and always only partly solved, unicode is just broken at this point (we had a customer dealing with ancient indogerman languages - the components for the weird chars are all in the unicode spec but there is like no tool to input it correctly and only one font capable to draw 70% of them.)

A terminal with the monospace env makes this even more complicated - it all has to fit somehow into a cell grid. Example - this is perfect legal in unicode (example from here):

'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞'

It will break all terminal emulators out there - cursor in wrong postition, overdrawn rubbish, wrong cell widths and heights. It is also a good example to play with different fonts in a wordprocessor, the output will be quite different.
Here are some thoughts about Javascript and unicode. Somewhat disheartening.

@parisk
Copy link
Contributor

parisk commented Jul 11, 2016

I am certain that there is no solution that fits them all. I believe that the most important part is to decide what Unicode features does xterm.js support and not put any effort to anything beyond these.

Not sure how big each segment is though at the moment (CJK, Hindi, ancient indogerman etc.) in order to decide which ones have the most impact.

@Tyriar
Copy link
Member Author

Tyriar commented Sep 17, 2016

Closing since I did the IME work recently and we all don't know enough about Hindi to verify correct behavior 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants