Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to UTF-8 input on Windows #68

Closed
alexrp opened this issue Jan 2, 2022 · 4 comments
Closed

Switch to UTF-8 input on Windows #68

alexrp opened this issue Jan 2, 2022 · 4 comments
Labels
area: drivers Issues related to the terminal drivers. os: windows Issues that are specific to Windows (10, 11, etc). type: housekeeping Issues that are classified as cleanup or refactoring work.
Milestone

Comments

@alexrp
Copy link
Sponsor Member

alexrp commented Jan 2, 2022

Get rid of this whole workaround:

// The Windows console host is eventually going to support UTF-8 input via the ReadFile function. Sadly, this
// does not work today; non-ASCII characters just turn into NULs. This means that we have to use the
// ReadConsoleW function for interactive input and ReadFile for redirected input. This complicates the
// interactive case considerably since ReadConsoleW operates in terms of UTF-16 code units while the API we
// offer operates in terms of raw bytes.
//
// To solve this problem, we read one or two UTF-16 code units to form a complete code point. We then encode
// that into UTF-8 in a separate buffer. Finally, we copy as many bytes as possible/requested from the UTF-8
// buffer to the caller-provided buffer.
using (_semaphore.Enter(cancellationToken))
{
if (_buffered.IsEmpty)
{
var units = (stackalloc char[2]);
var chars = 0;
fixed (char* p = &MemoryMarshal.GetReference(units))
{
bool ret;
var read = 0u;
while ((ret = ReadConsoleW(Handle, p, 1, out read, null)) &&
Marshal.GetLastPInvokeError() == (int)WIN32_ERROR.ERROR_OPERATION_ABORTED &&
read == 0)
{
// Retry in case we get interrupted by a signal.
}
if (!ret)
WindowsTerminalUtility.ThrowIfUnexpected($"Could not read from {Name}");
if (read == 0)
return 0;
// There is a bug where ReadConsoleW will not process Ctrl-Z properly even though ReadFile will. The
// good news is that we can fairly easily emulate what the console host should be doing by just
// pretending that there is no more data to be read.
if (!Terminal.IsRawMode && *p == '\x1a')
return 0;
chars++;
// If we got a high surrogate, we expect to instantly see a low surrogate following it. In really
// bizarre situations (e.g. broken WriteConsoleInput calls), this might not be the case though; in
// such a case, we will just let UTF8Encoding encode the lone high surrogate into a replacement
// character (U+FFFD).
//
// It is not really clear whether this is the right thing to do. A case could easily be made for
// passing the lone surrogate through unmodified or simply discarding it...
if (char.IsHighSurrogate(*p))
{
while ((ret = ReadConsoleW(Handle, p + 1, 1, out read, null)) &&
Marshal.GetLastPInvokeError() == (int)WIN32_ERROR.ERROR_OPERATION_ABORTED &&
read == 0)
{
// Retry in case we get interrupted by a signal.
}
if (read != 0)
chars++;
else if (!ret)
WindowsTerminalUtility.ThrowIfUnexpected($"Could not read from {Name}");
}
// Encode the UTF-16 code unit(s) into UTF-8 and grab a slice of the buffer corresponding to just
// the portion used.
_buffered = _buffer.AsMemory(..Cathode.Terminal.Encoding.GetBytes(units[..chars], _buffer));
}
}
// Now that we have some UTF-8 text buffered up, we can copy it over to the buffer provided by the caller
// and adjust our UTF-8 buffer accordingly. Be careful not to overrun either buffer.
var copied = Math.Min(_buffered.Length, buffer.Length);
_buffered.Span[..copied].CopyTo(buffer[..copied]);
_buffered = _buffered[copied..];
return copied;
}

There should be no functional change; this is just cleanup. Maybe some slight performance gains at best.

@alexrp alexrp added os: windows Issues that are specific to Windows (10, 11, etc). type: housekeeping Issues that are classified as cleanup or refactoring work. state: blocked Issues that are blocked on some other issue/work. area: drivers Issues related to the terminal drivers. labels Jan 2, 2022
@alexrp alexrp added this to the v2.0 milestone Jan 2, 2022
@alexrp alexrp self-assigned this Jan 2, 2022
@alexrp
Copy link
Sponsor Member Author

alexrp commented Feb 15, 2023

Looks like there is interesting progress happening on this: microsoft/terminal#14745

@alexrp
Copy link
Sponsor Member Author

alexrp commented Dec 12, 2023

Note to self: Need to investigate how well UTF-8 input works on the latest Windows Terminal version. If it mostly Just Works, then we should switch to using it (and it'll be yet another strong argument for #102).

@alexrp
Copy link
Sponsor Member Author

alexrp commented Dec 14, 2023

Need to investigate how well UTF-8 input works on the latest Windows Terminal version.

Quite well, as it turns out! 🏴‍☠️ (4 code points) roundtrips as it should both with UTF-16 input (status quo) and UTF-8 input (with the ReadConsoleW workaround removed).

However, in cmd.exe, it only roundtrips with UTF-16 input. So, this really comes down to #102.

@alexrp alexrp modified the milestones: v3.0, v1.0 Dec 14, 2023
@alexrp alexrp removed the state: blocked Issues that are blocked on some other issue/work. label Dec 15, 2023
@alexrp alexrp closed this as completed in d2262a9 Dec 15, 2023
@alexrp alexrp removed their assignment Jan 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: drivers Issues related to the terminal drivers. os: windows Issues that are specific to Windows (10, 11, etc). type: housekeeping Issues that are classified as cleanup or refactoring work.
Development

No branches or pull requests

1 participant