COOKED_READ doesn't return UTF-8 on *A APIs in CP_UTF8 #4551

amyw-msft · 2020-02-12T18:57:15Z

Environment

Microsoft Windows [Version 10.0.18363.592]

Impact

This issue is affecting reading console input via the Universal C Runtime as well - _read, getchar, fread, scanf, etc. Using _cgets_s only works around this issue because it uses ReadConsoleW instead of ReadFile. This is also reported against the UCRT on Developer Community here: _read() cannot read UTF-8 but _cgets_s() can.

Steps to reproduce

When using ReadFile to read from a console handle, UTF-8 input is not correctly returned. Using ReadFile on other types of handles (files, pipes) can read UTF-8 without issue. SetConsoleCP and SetConsoleOutputCP do not appear to affect this behavior.

C:\Users\stwish\source\read_utf8>type win32_test.cpp
#include <Windows.h>
#include <stdio.h>

int main()
{
    SetConsoleCP(65001);
    SetConsoleOutputCP(65001);
    const HANDLE console_stdin = GetStdHandle(STD_INPUT_HANDLE);

    const size_t buf_count = 20;
    char buffer[buf_count]{};

    DWORD num_read;

    BOOL result = ReadFile(
        console_stdin,
        buffer,
        buf_count,
        &num_read,
        nullptr
        );

    printf("ReadFile returned '%d'\n", result);
    for (int i = 0; i < 20; i++)
    {
        printf("%02x ", (unsigned char)buffer[i]);
    }

    return 0;
}

C:\Users\stwish\source\read_utf8>cl /nologo /EHsc /MT win32_test.cpp /Zi
win32_test.cpp

C:\Users\stwish\source\read_utf8>win32_test.exe
我是中文字符
ReadFile returned '1'
00 00 00 00 00 00 0d 0a 00 00 00 00 00 00 00 00 00 00 00 00

C:\Users\stwish\source\read_utf8>echo 我是中文字符 | win32_test.exe
ReadFile returned '1'
e6 88 91 e6 98 af e4 b8 ad e6 96 87 e5 ad 97 e7 ac a6 20 0d

C:\Users\stwish\source\read_utf8>type input.txt
我是中文字符

C:\Users\stwish\source\read_utf8>type input.txt | win32_test.exe
ReadFile returned '1'
e6 88 91 e6 98 af e4 b8 ad e6 96 87 e5 ad 97 e7 ac a6 00 00

Expected behavior

Running win32_test.exe and entering '我是中文字符' input on the console should return e6 88 91 e6 98 af e4 b8 ad e6 96 87 e5 ad 97 e7 ac a6 0d 0a as this is the UTF-8 representation of that string, plus CR LF.

Actual behavior

Running win32_test.exe and entering '我是中文字符' input on the console will return 6 null characters and CR LF, but still returns that the read operation was successful.

The text was updated successfully, but these errors were encountered:

eryksun · 2020-02-13T00:30:22Z

ReadFile and ReadConsoleA are currently limited to 7-bit ASCII when the input codepage is UTF-8 (65001) due to an assumption of 1 CHAR per WCHAR when calling WideCharToMultiByte.

A ordinal in the range [0x000000, 0x00FFFF], i.e. the Basic Multilingual Plane (BMP), uses a single WCHAR value. UTF-8 uses 1 byte per ASCII ordinal in the range [0x000000, 0x00007F]. UTF-8 uses 2-3 bytes per non-ASCII ordinal in the range [0x000080, 0x00FFFF]. A non-BMP ordinal in the range [0x010000, 0x10FFFF] uses two WCHAR values to store a UTF-16 surrogate pair. UTF-8 uses 4 bytes per non-BMP ordinal. (The maximum Unicode ordinal is capped by design at 0x10FFFF for compatibility with UTF-16, which uses a reserved 10-bit range in a pair of 16-bit codes to support a 20-bit space with an additional 16 supplementary planes, e.g. [0x010000, 0x01FFFF], [0x020000, 0x02FFFF], and so on up to [0x100000, 0x10FFFF].)

For non-ASCII ordinals, the internal WideCharToMultiByte call fails, and the initial null byte value is used. In Windows 10, with the new console enabled, we get 1 null byte in the result per WCHAR because it encodes one code at a time. For non-BMP ordinals, given the assumption of one CHAR per WCHAR and encoding one code at a time (i.e. naive handling of surrogate pairs), we should get two null bytes per non-BMP ordinal. However, with the new console I can't even get wide-character ReadConsoleW to work with non-BMP ordinals. It translates a UTF-16 surrogate pair to the replacement character (0x00FFFD). ReadConsoleW works fine with non-BMP ordinals in the legacy console, so some change in the new console has broken UTF-16 support in cooked reads.

For ReadFile and ReadConsoleA in older versions of Windows, or if we enable the legacy console in Windows 10, using UTF-8 as the input codepage causes a 'successful' read of 0 bytes if the read contains even one non-ASCII ordinal. Many programs interpret a successful read of 0 bytes as EOF.

DHowett-MSFT · 2020-02-13T08:04:29Z

@eryksun thanks for the detailed write-up, and @stwish-msft thanks for the report. It looks like we don't actually have a bug tracking COOKED_READ not supporting UTF-8, but we do know about it. Perhaps it got lost in the transition our of Azure DevOps?

Regardless, this is now the one.

Eryk, would you mind filing a separate issue for the non-BMP ReadConsoleW bug? That one is likely more readily fixable than this one, and since it's a regression from legacy (and I've got a copy of every shipped version of conhost, so we can find out exactly when :P) it's pretty important.

DHowett-MSFT · 2020-02-13T08:06:18Z

I'm giving this one the unusual denomination of "bugtask". We have a couple of them -- it's a bug, yes, but it's a fairly big chunk of work and a new feature to boot. 😄

clinton-r · 2020-12-22T06:02:32Z

Hi, I'm assuming that by "cooked" you mean that the following are enabled in the console mode on stdin:
ENABLE_ECHO_INPUT
ENABLE_LINE_INPUT
ENABLE_PROCESSED_INPUT
Is that right?

For me, the problem occurs even with those disabled.

Here is my test code:

#include <stdio.h>
#include <windows.h>

int main(void)
{
    // Set UTF-8 code page for input and output
    if (!SetConsoleOutputCP(65001))
        return 1;

    if (!SetConsoleCP(65001))
        return 2;

    printf("Output code page is %d, input code page is %d\n",
        (int)GetConsoleOutputCP(), (int)GetConsoleCP());

    puts(u8"We can output utf-8: γατάκι");

    HANDLE hStdin = GetStdHandle(STD_INPUT_HANDLE);
    if (hStdin == INVALID_HANDLE_VALUE || hStdin == NULL)
        return 3;

    // Set input mode
    DWORD mode;
    if (!GetConsoleMode(hStdin, &mode))
        return 4;
    mode &= ~(ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT);
    if (!SetConsoleMode(hStdin, mode))
        return 5;
    if (!GetConsoleMode(hStdin, &mode))
        return 6;
    printf("Console mode for stdin is 0x%08x\n", (int)mode);

    puts("Input is now in 'raw' mode.  Type something.");

    char b;
    do {
        DWORD numRead;
        if (!ReadFile(hStdin, &b, 1, &numRead, NULL) || numRead == 0)
            return 7;
        printf("%02x ", (int)b & 0x0ff);
    } while (b != 'q');

    return 0;
}

If I run this Windows Terminal running cmd.exe and paste the text "I8Σπ q" when prompted, the output looks like this:

C:\code\c\vt_experiments>test
Output code page is 65001, input code page is 65001
We can output utf-8: γατάκι
Console mode for stdin is 0x000001f0
Input is now in 'raw' mode.  Type something.
49 38 00 00 20 71
C:\code\c\vt_experiments>

You can see that Σ and π are read as zeros.

Conhost running cmd.exe is the same except that the reported console mode is 0x1b0.

Some of the set flags in the console modes 0x1f0 or 0x1b0 seem to be undocumented, or am I reading them wrong? Console mode flags reference: https://docs.microsoft.com/en-us/windows/console/high-level-console-modes

Instead of ReadFile() I've also tried getchar(), scanf_s(), fgetc() - none of them worked either.

I'm new to console programming, so sorry if I've overlooked something that should be obvious.

Thanks!

Microsoft Visual Studio Community 2019 Version 16.8.3
Microsoft Windows [Version 10.0.19041.685]
Windows Terminal Version: 1.4.3243.0

Due to a bug in Windows, ReadFile() and ReadConsoleA() (and thus _read()), return zeros instead of non-ASCII characters when the console codepage is set to 65001. See this ticket for more details: microsoft/terminal#4551 This commit works around that bug by using ReadConsoleW() inside win32_read() when the passed fd points to the console and the console codepage is set to 65001. Fixes Perl#18701

Due to a bug in Windows, ReadFile() and ReadConsoleA() (and thus _read()), return zeros instead of non-ASCII characters when the console codepage is set to 65001. See this ticket for more details: microsoft/terminal#4551 This commit works around that bug by using ReadConsoleW() inside win32_read() when the passed fd points to the console and the console codepage is set to 65001. Fixes #18701

Due to a bug in Windows, ReadFile() and ReadConsoleA() (and thus _read()), return zeros instead of non-ASCII characters when the console codepage is set to 65001. See this ticket for more details: microsoft/terminal#4551 This commit works around that bug by using ReadConsoleW() inside win32_read() when the passed fd points to the console and the console codepage is set to 65001. Fixes Perl#18701

…h UTF8 codepage Corresponding Windows bug microsoft/terminal#4551 Use ReadConsoleW instead and convert to console's input codepage, to workaround. Also, disable VT sequences in the console output, as we do not knows what type of data comes with SELECT, we do not want VT escapes there. Remove my_cgets()

… UTF8 codepage Corresponding Windows bug microsoft/terminal#4551 Use ReadConsoleW instead and convert to console's input codepage, to workaround. Also, disable VT sequences in the console output, as we do not knows what type of data comes with SELECT, we do not want VT escapes there. Remove my_cgets()

The overarching intention of this PR is to improve our Unicode support. Most of our APIs still don't support anything beyond UCS-2 and DBCS sequences. This commit doesn't fix the UTF-16 support (by supporting surrogate pairs), but it does improve support for UTF-8 by allowing longer `char` sequences. It does so by removing `TranslateUnicodeToOem` which seems to have had an almost viral effect on code quality wherever it was used. It made the assumption that _all_ narrow glyphs encode to 1 `char` and most wide glyphs to 2 `char`s. It also didn't bother to check whether `WideCharToMultiByte` failed or returned a different amount of `char`s. So up until now it was easily possible to read uninitialized stack memory from conhost. Any code that used this function was forced to do the same "measurement" of narrow/wide glyphs, because _of course_ it didn't had any way to indicate to the caller how much memory it needs to store the result. Instead all callers were forced to sorta replicate how it worked to calculate the required storage ahead of time. Unsurprisingly, none of the callers used the same algorithm... Without it the code is much leaner and easier to understand now. The best example is `COOKED_READ_DATA::_handlePostCharInputLoop` which used to contain 3 blocks of _almost_ identical code, but with ever so subtle differences. After reading the old code for hours I still don't know if they were relevant or not. It used to be 200 lines of code lacking any documentation and it's now 50 lines with descriptive function names. I hope this doesn't break anything, but to be honest I can't imagine anyone having relied on this mess in the first place. I needed some helpers to handle byte slices (`std::span<char>`), which is why a new `til/bytes.h` header was added. Initially I wrote a `buf_writer` class but felt like such a wrapper around a slice/span was annoying to use. As such I've opted for freestanding functions which take slices as mutable references and "advance" them (offset the start) whenever they're read from or written to. I'm not particularly happy with the design but they do the job. Related to #8000 Fixes #4551 Fixes #7589 Fixes #8663 ## Validation Steps Performed * Unit and feature tests ✅ * Far Manager ✅ * Fixes test cases in #4551, #7589 and #8663 ✅

ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Feb 12, 2020

DHowett-MSFT changed the title ~~ReadFile on stdin fails to read UTF-8~~ COOKED_READ doesn't return UTF-8 on *A APIs in CP_UTF8 Feb 13, 2020

ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Feb 13, 2020

DHowett-MSFT added this to the Console Backlog milestone Feb 13, 2020

jdebp mentioned this issue Apr 28, 2020

Emoji input doesn't input anything #5617

Closed

wez mentioned this issue May 22, 2020

Broken Umlauts and other non-ascii chars in output wez/EleDo#1

Closed

DHowett mentioned this issue Oct 6, 2020

2009 update MicrosoftDocs/Console-Docs#162

Merged

krwq mentioned this issue Oct 13, 2020

Console UTF-8 input is misbehaving on Windows dotnet/runtime#43295

Closed

miniksa mentioned this issue Oct 21, 2020

[Epic] COOKED READ and character conversion fun with all our different Convert* functions #7777

Closed

xenu mentioned this issue Apr 9, 2021

Reading non-ASCII characters from the console on Windows doesn't work when codepage is 65001 Perl/perl5#18701

Closed

xenu mentioned this issue Apr 9, 2021

win32.c: make reading UTF-8 characters from the console possible Perl/perl5#18702

Merged

imba-tjd mentioned this issue Jun 28, 2021

Non-ASCII char support for Windows terminals microsoft/vscode-java-debug#622

Closed

DHowett mentioned this issue Jul 2, 2021

C setlocale() not setting codepage #8630

Closed

ianjoneill mentioned this issue Sep 7, 2021

Java terminal app doesn't scan Cyrillic script. #11168

Closed

This comment has been minimized.

Sign in to view

alexrp mentioned this issue Jan 2, 2022

Switch to UTF-8 input on Windows vezel-dev/cathode#68

Closed

zadjii-msft modified the milestones: Console Backlog, 22H2 Jan 4, 2022

zadjii-msft added the Area-CookedRead The cmd.exe COOKED_READ handling label Feb 23, 2022

DHowett mentioned this issue Jan 27, 2023

Remove TranslateUnicodeToOem and all related code #14745

Merged

ghost added the In-PR This issue has a related PR label Jan 31, 2023

DHowett closed this as completed in #14745 Feb 28, 2023

microsoft-github-policy-service bot added the Needs-Tag-Fix Doesn't match tag requirements label Feb 28, 2023

dfyz mentioned this issue Apr 19, 2023

[User] Interactive mode immediately exits on Windows with Zig ggerganov/llama.cpp#1010

Closed

4 tasks

lhecker mentioned this issue Sep 25, 2023

ReadConsole does not work with utf-8 codepage #16020

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COOKED_READ doesn't return UTF-8 on *A APIs in CP_UTF8 #4551

COOKED_READ doesn't return UTF-8 on *A APIs in CP_UTF8 #4551

amyw-msft commented Feb 12, 2020

eryksun commented Feb 13, 2020

DHowett-MSFT commented Feb 13, 2020

DHowett-MSFT commented Feb 13, 2020

clinton-r commented Dec 22, 2020

This comment has been minimized.

COOKED_READ doesn't return UTF-8 on *A APIs in CP_UTF8 #4551

COOKED_READ doesn't return UTF-8 on *A APIs in CP_UTF8 #4551

Comments

amyw-msft commented Feb 12, 2020

Environment

Impact

Steps to reproduce

Expected behavior

Actual behavior

eryksun commented Feb 13, 2020

DHowett-MSFT commented Feb 13, 2020

DHowett-MSFT commented Feb 13, 2020

clinton-r commented Dec 22, 2020

This comment has been minimized.