surrogate pair handling is slightly off #1804

jerch · 2018-11-30T13:19:25Z

The surrogate pair handling in Inputhandler is slightly off - currently we preserve any last char in InputHandler.print that counts as a first surrogate and re-add it to the next data chunk. Since InputHandler.print gets called for every printable slice of chars within a single chunk of data this possibly skips printable slices in the data chunk and wrongly moves the surrogate char to the next chunk.

Solution:
We should only check for surrogates at the real chunk end and re-add the char to next chunk.

Why do we care at all? Isnt it guaranteed to get only valid unicode codepoints? - No. Few notes on the background:
JS strings are represented as UCS2 encoding with optional UTF16 surrogate pair extension. This means that the browser will correctly decode to a high plane unicode codepoint if the 16bit charCodes form a valid surrogate pair (UTF16 handling), but it will also show something when a single char of the surrogate pair charCode range appears (UCS2 handling).
A terminal emulator typically speaks with some program running behind a pseudoterminal. Nowadays all pseudoterminals default to UTF8 as transport encoding, even Windows since one of the last rolling releases (well thats only partially true, pseudoterminals do not care much about encodings at all and rather pump raw bytes that happen to be UTF8 encoded, which leads to weird mixes of UTF8 and raw bytes). node-pty decodes the UTF8 byte sequences into a JS string that gets forwarded to the emulator. On the way to the emulator the data has to go through several buffers with several forth and back encoding. Most transport layers only care for byte sequences, since our emulator data entry point is a JS string Javascript gives the guaranty that it forms valid UCS2 charCodes, but does not "wait" for valid UTF16 surrogates. Thus a surrogate pair can span 2 chunks of data.
Because of the UTF8 source type of the data we can safely assume UTF16 in the emulator, but have to do the surrogate pair matching explicitly at the chunk borders since JS does not do it for us.

Btw same applies to UTF8 input, if we decide to go full UTF8 we have to handle partially transferred codepoints as well.

The text was updated successfully, but these errors were encountered:

egmontkob · 2018-11-30T14:49:00Z

pseudoterminals do not care much about encodings at all and rather pump raw bytes

The kernel line discipline does a couple of things to the data, and expects it to be ASCII-compatible. UTF-16 cannot be used here. See here. (No clue about Windows.)

Btw same applies to UTF8 input, if we decide to go full UTF8 we have to handle partially transferred codepoints as well.

I guess you already do it when handling data arriving from the apps, don't you? See e.g. https://bugzilla.gnome.org/show_bug.cgi?id=154896 (one of my earliest contributions to vte) or https://phab.enlightenment.org/T655 (a still unfixed one) – quite some terminal emulators go through this bug during their early stage of development :)

jerch · 2018-11-30T15:26:45Z

The kernel line discipline does a couple of things to the data, and expects it to be ASCII-compatible. UTF-16 cannot be used here. See here. (No clue about Windows.)

Yes I am aware of that, thats why we currently handle the pty data as UTF8 encoded (which gets penalized by the JS string conversion as it converts a valid ASCII char into 2 bytes). Windows switched to UTF8 with their new conpty implementation it seems, before it was either some 8bit encoding (like in old unix systems) or UTF16 (what they call "ANSI encoding").

About UTF8 in the emulator - nope we dont care for UTF8 at all atm, we currently leave the half transmitted problem to nodejs (in node-pty) or the JS engine (when it arrives as JS string in the browser engine). The emulator currently parses UTF16 (since thats the closest thing to JS strings) and transcodes the chars to UTF32 for the terminal buffer. With the new typed array buffer we now can decide which encoding to go with internally, and I think we can save some forth and back encoding by supporting UTF8 byte sequences as data entry. (Still undecided, see #791 (comment)).

Edit: I am pretty sure we dont want to go full UTF8 for internal representation as we would inherit the multibyte problem all over the place which complicates string handling a lot (Ive started a rust clone of the parser + input handler, even there with rust's native UTF8 strings it performs worse than UTF16). UTF16 might be worth a shot for the internal buffer as surrogate pairs are quite rare. But this is kinda offtopic here...

jerch · 2019-07-02T20:56:14Z

Closing the issue as it got resolved by the utf32 decoding automatically.

jerch added the type/bug Something is misbehaving label Nov 30, 2018

jerch self-assigned this Nov 30, 2018

jerch closed this as completed Jul 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

surrogate pair handling is slightly off #1804

surrogate pair handling is slightly off #1804

jerch commented Nov 30, 2018

egmontkob commented Nov 30, 2018

jerch commented Nov 30, 2018 •

edited

Loading

jerch commented Jul 2, 2019

surrogate pair handling is slightly off #1804

surrogate pair handling is slightly off #1804

Comments

jerch commented Nov 30, 2018

egmontkob commented Nov 30, 2018

jerch commented Nov 30, 2018 • edited Loading

jerch commented Jul 2, 2019

jerch commented Nov 30, 2018 •

edited

Loading