Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COOKED_READ doesn't return UTF-8 on *A APIs in CP_UTF8 #4551

Closed
amyw-msft opened this issue Feb 12, 2020 · 5 comments · Fixed by #14745
Closed

COOKED_READ doesn't return UTF-8 on *A APIs in CP_UTF8 #4551

amyw-msft opened this issue Feb 12, 2020 · 5 comments · Fixed by #14745
Labels
Area-CookedRead The cmd.exe COOKED_READ handling Area-Input Related to input processing (key presses, mouse, etc.) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. In-PR This issue has a related PR Issue-Bug It either shouldn't be doing this or needs an investigation. Issue-Task It's a feature request, but it doesn't really need a major design. Needs-Tag-Fix Doesn't match tag requirements Priority-2 A description (P2) Product-Conhost For issues in the Console codebase
Milestone

Comments

@amyw-msft
Copy link
Member

Environment

Microsoft Windows [Version 10.0.18363.592]

Impact

This issue is affecting reading console input via the Universal C Runtime as well - _read, getchar, fread, scanf, etc. Using _cgets_s only works around this issue because it uses ReadConsoleW instead of ReadFile. This is also reported against the UCRT on Developer Community here: _read() cannot read UTF-8 but _cgets_s() can.

Steps to reproduce

When using ReadFile to read from a console handle, UTF-8 input is not correctly returned. Using ReadFile on other types of handles (files, pipes) can read UTF-8 without issue. SetConsoleCP and SetConsoleOutputCP do not appear to affect this behavior.

C:\Users\stwish\source\read_utf8>type win32_test.cpp
#include <Windows.h>
#include <stdio.h>

int main()
{
    SetConsoleCP(65001);
    SetConsoleOutputCP(65001);
    const HANDLE console_stdin = GetStdHandle(STD_INPUT_HANDLE);

    const size_t buf_count = 20;
    char buffer[buf_count]{};

    DWORD num_read;

    BOOL result = ReadFile(
        console_stdin,
        buffer,
        buf_count,
        &num_read,
        nullptr
        );

    printf("ReadFile returned '%d'\n", result);
    for (int i = 0; i < 20; i++)
    {
        printf("%02x ", (unsigned char)buffer[i]);
    }

    return 0;
}
C:\Users\stwish\source\read_utf8>cl /nologo /EHsc /MT win32_test.cpp /Zi
win32_test.cpp
C:\Users\stwish\source\read_utf8>win32_test.exe
我是中文字符
ReadFile returned '1'
00 00 00 00 00 00 0d 0a 00 00 00 00 00 00 00 00 00 00 00 00
C:\Users\stwish\source\read_utf8>echo 我是中文字符 | win32_test.exe
ReadFile returned '1'
e6 88 91 e6 98 af e4 b8 ad e6 96 87 e5 ad 97 e7 ac a6 20 0d
C:\Users\stwish\source\read_utf8>type input.txt
我是中文字符

C:\Users\stwish\source\read_utf8>type input.txt | win32_test.exe
ReadFile returned '1'
e6 88 91 e6 98 af e4 b8 ad e6 96 87 e5 ad 97 e7 ac a6 00 00

Expected behavior

Running win32_test.exe and entering '我是中文字符' input on the console should return e6 88 91 e6 98 af e4 b8 ad e6 96 87 e5 ad 97 e7 ac a6 0d 0a as this is the UTF-8 representation of that string, plus CR LF.

Actual behavior

Running win32_test.exe and entering '我是中文字符' input on the console will return 6 null characters and CR LF, but still returns that the read operation was successful.

@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Feb 12, 2020
@eryksun
Copy link

eryksun commented Feb 13, 2020

ReadFile and ReadConsoleA are currently limited to 7-bit ASCII when the input codepage is UTF-8 (65001) due to an assumption of 1 CHAR per WCHAR when calling WideCharToMultiByte.

A ordinal in the range [0x000000, 0x00FFFF], i.e. the Basic Multilingual Plane (BMP), uses a single WCHAR value. UTF-8 uses 1 byte per ASCII ordinal in the range [0x000000, 0x00007F]. UTF-8 uses 2-3 bytes per non-ASCII ordinal in the range [0x000080, 0x00FFFF]. A non-BMP ordinal in the range [0x010000, 0x10FFFF] uses two WCHAR values to store a UTF-16 surrogate pair. UTF-8 uses 4 bytes per non-BMP ordinal. (The maximum Unicode ordinal is capped by design at 0x10FFFF for compatibility with UTF-16, which uses a reserved 10-bit range in a pair of 16-bit codes to support a 20-bit space with an additional 16 supplementary planes, e.g. [0x010000, 0x01FFFF], [0x020000, 0x02FFFF], and so on up to [0x100000, 0x10FFFF].)

For non-ASCII ordinals, the internal WideCharToMultiByte call fails, and the initial null byte value is used. In Windows 10, with the new console enabled, we get 1 null byte in the result per WCHAR because it encodes one code at a time. For non-BMP ordinals, given the assumption of one CHAR per WCHAR and encoding one code at a time (i.e. naive handling of surrogate pairs), we should get two null bytes per non-BMP ordinal. However, with the new console I can't even get wide-character ReadConsoleW to work with non-BMP ordinals. It translates a UTF-16 surrogate pair to the replacement character (0x00FFFD). ReadConsoleW works fine with non-BMP ordinals in the legacy console, so some change in the new console has broken UTF-16 support in cooked reads.

For ReadFile and ReadConsoleA in older versions of Windows, or if we enable the legacy console in Windows 10, using UTF-8 as the input codepage causes a 'successful' read of 0 bytes if the read contains even one non-ASCII ordinal. Many programs interpret a successful read of 0 bytes as EOF.

@DHowett-MSFT
Copy link
Contributor

@eryksun thanks for the detailed write-up, and @stwish-msft thanks for the report. It looks like we don't actually have a bug tracking COOKED_READ not supporting UTF-8, but we do know about it. Perhaps it got lost in the transition our of Azure DevOps?

Regardless, this is now the one.

Eryk, would you mind filing a separate issue for the non-BMP ReadConsoleW bug? That one is likely more readily fixable than this one, and since it's a regression from legacy (and I've got a copy of every shipped version of conhost, so we can find out exactly when :P) it's pretty important.

@DHowett-MSFT DHowett-MSFT changed the title ReadFile on stdin fails to read UTF-8 COOKED_READ doesn't return UTF-8 on *A APIs in CP_UTF8 Feb 13, 2020
@DHowett-MSFT DHowett-MSFT added Area-Input Related to input processing (key presses, mouse, etc.) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Issue-Task It's a feature request, but it doesn't really need a major design. Priority-2 A description (P2) Product-Conhost For issues in the Console codebase and removed Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Feb 13, 2020
@ghost ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Feb 13, 2020
@DHowett-MSFT DHowett-MSFT added this to the Console Backlog milestone Feb 13, 2020
@DHowett-MSFT
Copy link
Contributor

I'm giving this one the unusual denomination of "bugtask". We have a couple of them -- it's a bug, yes, but it's a fairly big chunk of work and a new feature to boot. 😄

@clinton-r
Copy link

Hi, I'm assuming that by "cooked" you mean that the following are enabled in the console mode on stdin:
ENABLE_ECHO_INPUT
ENABLE_LINE_INPUT
ENABLE_PROCESSED_INPUT
Is that right?

For me, the problem occurs even with those disabled.

Here is my test code:

#include <stdio.h>
#include <windows.h>

int main(void)
{
    // Set UTF-8 code page for input and output
    if (!SetConsoleOutputCP(65001))
        return 1;

    if (!SetConsoleCP(65001))
        return 2;

    printf("Output code page is %d, input code page is %d\n",
        (int)GetConsoleOutputCP(), (int)GetConsoleCP());

    puts(u8"We can output utf-8: γατάκι");

    HANDLE hStdin = GetStdHandle(STD_INPUT_HANDLE);
    if (hStdin == INVALID_HANDLE_VALUE || hStdin == NULL)
        return 3;

    // Set input mode
    DWORD mode;
    if (!GetConsoleMode(hStdin, &mode))
        return 4;
    mode &= ~(ENABLE_ECHO_INPUT | ENABLE_LINE_INPUT | ENABLE_PROCESSED_INPUT);
    if (!SetConsoleMode(hStdin, mode))
        return 5;
    if (!GetConsoleMode(hStdin, &mode))
        return 6;
    printf("Console mode for stdin is 0x%08x\n", (int)mode);

    puts("Input is now in 'raw' mode.  Type something.");

    char b;
    do {
        DWORD numRead;
        if (!ReadFile(hStdin, &b, 1, &numRead, NULL) || numRead == 0)
            return 7;
        printf("%02x ", (int)b & 0x0ff);
    } while (b != 'q');

    return 0;
}

If I run this Windows Terminal running cmd.exe and paste the text "I8Σπ q" when prompted, the output looks like this:

C:\code\c\vt_experiments>test
Output code page is 65001, input code page is 65001
We can output utf-8: γατάκι
Console mode for stdin is 0x000001f0
Input is now in 'raw' mode.  Type something.
49 38 00 00 20 71
C:\code\c\vt_experiments>

You can see that Σ and π are read as zeros.

Conhost running cmd.exe is the same except that the reported console mode is 0x1b0.

Some of the set flags in the console modes 0x1f0 or 0x1b0 seem to be undocumented, or am I reading them wrong? Console mode flags reference: https://docs.microsoft.com/en-us/windows/console/high-level-console-modes

Instead of ReadFile() I've also tried getchar(), scanf_s(), fgetc() - none of them worked either.

I'm new to console programming, so sorry if I've overlooked something that should be obvious.

Thanks!

Microsoft Visual Studio Community 2019 Version 16.8.3
Microsoft Windows [Version 10.0.19041.685]
Windows Terminal Version: 1.4.3243.0

xenu added a commit to xenu/perl5 that referenced this issue Apr 9, 2021
Due to a bug in Windows, ReadFile() and ReadConsoleA() (and thus
_read()), return zeros instead of non-ASCII characters when the console
codepage is set to 65001. See this ticket for more details:
microsoft/terminal#4551

This commit works around that bug by using ReadConsoleW() inside
win32_read() when the passed fd points to the console and the console
codepage is set to 65001.

Fixes Perl#18701
xenu added a commit to xenu/perl5 that referenced this issue Apr 10, 2021
Due to a bug in Windows, ReadFile() and ReadConsoleA() (and thus
_read()), return zeros instead of non-ASCII characters when the console
codepage is set to 65001. See this ticket for more details:
microsoft/terminal#4551

This commit works around that bug by using ReadConsoleW() inside
win32_read() when the passed fd points to the console and the console
codepage is set to 65001.

Fixes Perl#18701
xenu added a commit to Perl/perl5 that referenced this issue Apr 13, 2021
Due to a bug in Windows, ReadFile() and ReadConsoleA() (and thus
_read()), return zeros instead of non-ASCII characters when the console
codepage is set to 65001. See this ticket for more details:
microsoft/terminal#4551

This commit works around that bug by using ReadConsoleW() inside
win32_read() when the passed fd points to the console and the console
codepage is set to 65001.

Fixes #18701
Corion pushed a commit to Corion/perl5 that referenced this issue Jun 20, 2021
Due to a bug in Windows, ReadFile() and ReadConsoleA() (and thus
_read()), return zeros instead of non-ASCII characters when the console
codepage is set to 65001. See this ticket for more details:
microsoft/terminal#4551

This commit works around that bug by using ReadConsoleW() inside
win32_read() when the passed fd points to the console and the console
codepage is set to 65001.

Fixes Perl#18701
vaintroub added a commit to MariaDB/server that referenced this issue Nov 23, 2021
…h UTF8 codepage

Corresponding Windows bug  microsoft/terminal#4551

Use ReadConsoleW instead and convert to console's input codepage, to
workaround.

Also, disable VT sequences in the console output, as we do not knows what
type of data comes with SELECT, we do not want VT escapes there.

Remove my_cgets()
vaintroub added a commit to MariaDB/server that referenced this issue Nov 27, 2021
…h UTF8 codepage

Corresponding Windows bug  microsoft/terminal#4551

Use ReadConsoleW instead and convert to console's input codepage, to
workaround.

Also, disable VT sequences in the console output, as we do not knows what
type of data comes with SELECT, we do not want VT escapes there.

Remove my_cgets()
vaintroub added a commit to MariaDB/server that referenced this issue Dec 2, 2021
…h UTF8 codepage

Corresponding Windows bug  microsoft/terminal#4551

Use ReadConsoleW instead and convert to console's input codepage, to
workaround.

Also, disable VT sequences in the console output, as we do not knows what
type of data comes with SELECT, we do not want VT escapes there.

Remove my_cgets()
vaintroub added a commit to MariaDB/server that referenced this issue Dec 12, 2021
… UTF8 codepage

Corresponding Windows bug  microsoft/terminal#4551

Use ReadConsoleW instead and convert to console's input codepage, to
workaround.

Also, disable VT sequences in the console output, as we do not knows what
type of data comes with SELECT, we do not want VT escapes there.

Remove my_cgets()
vaintroub added a commit to MariaDB/server that referenced this issue Dec 13, 2021
… UTF8 codepage

Corresponding Windows bug  microsoft/terminal#4551

Use ReadConsoleW instead and convert to console's input codepage, to
workaround.

Also, disable VT sequences in the console output, as we do not knows what
type of data comes with SELECT, we do not want VT escapes there.

Remove my_cgets()
@defrag257

This comment has been minimized.

vuvova pushed a commit to MariaDB/server that referenced this issue Dec 15, 2021
… UTF8 codepage

Corresponding Windows bug  microsoft/terminal#4551

Use ReadConsoleW instead and convert to console's input codepage, to
workaround.

Also, disable VT sequences in the console output, as we do not knows what
type of data comes with SELECT, we do not want VT escapes there.

Remove my_cgets()
@zadjii-msft zadjii-msft modified the milestones: Console Backlog, 22H2 Jan 4, 2022
vaintroub added a commit to MariaDB/server that referenced this issue Jan 18, 2022
… UTF8 codepage

Corresponding Windows bug  microsoft/terminal#4551

Use ReadConsoleW instead and convert to console's input codepage, to
workaround.

Also, disable VT sequences in the console output, as we do not knows what
type of data comes with SELECT, we do not want VT escapes there.

Remove my_cgets()
@zadjii-msft zadjii-msft added the Area-CookedRead The cmd.exe COOKED_READ handling label Feb 23, 2022
@ghost ghost added the In-PR This issue has a related PR label Jan 31, 2023
DHowett pushed a commit that referenced this issue Feb 28, 2023
The overarching intention of this PR is to improve our Unicode support.
Most
of our APIs still don't support anything beyond UCS-2 and DBCS
sequences.
This commit doesn't fix the UTF-16 support (by supporting surrogate
pairs),
but it does improve support for UTF-8 by allowing longer `char`
sequences.

It does so by removing `TranslateUnicodeToOem` which seems to have had
an almost
viral effect on code quality wherever it was used. It made the
assumption that
_all_ narrow glyphs encode to 1 `char` and most wide glyphs to 2
`char`s.
It also didn't bother to check whether `WideCharToMultiByte` failed or
returned
a different amount of `char`s. So up until now it was easily possible to
read
uninitialized stack memory from conhost. Any code that used this
function was
forced to do the same "measurement" of narrow/wide glyphs, because _of
course_
it didn't had any way to indicate to the caller how much memory it needs
to
store the result. Instead all callers were forced to sorta replicate how
it
worked to calculate the required storage ahead of time.
Unsurprisingly, none of the callers used the same algorithm...

Without it the code is much leaner and easier to understand now. The
best
example is `COOKED_READ_DATA::_handlePostCharInputLoop` which used to
contain 3
blocks of _almost_ identical code, but with ever so subtle differences.
After
reading the old code for hours I still don't know if they were relevant
or not.
It used to be 200 lines of code lacking any documentation and it's now
50 lines
with descriptive function names. I hope this doesn't break anything, but
to
be honest I can't imagine anyone having relied on this mess in the first
place.

I needed some helpers to handle byte slices (`std::span<char>`), which
is why
a new `til/bytes.h` header was added. Initially I wrote a `buf_writer`
class
but felt like such a wrapper around a slice/span was annoying to use.
As such I've opted for freestanding functions which take slices as
mutable
references and "advance" them (offset the start) whenever they're read
from
or written to. I'm not particularly happy with the design but they do
the job.

Related to #8000
Fixes #4551
Fixes #7589
Fixes #8663

## Validation Steps Performed
* Unit and feature tests ✅
* Far Manager ✅
* Fixes test cases in #4551, #7589 and #8663
@microsoft-github-policy-service microsoft-github-policy-service bot added the Needs-Tag-Fix Doesn't match tag requirements label Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-CookedRead The cmd.exe COOKED_READ handling Area-Input Related to input processing (key presses, mouse, etc.) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. In-PR This issue has a related PR Issue-Bug It either shouldn't be doing this or needs an investigation. Issue-Task It's a feature request, but it doesn't really need a major design. Needs-Tag-Fix Doesn't match tag requirements Priority-2 A description (P2) Product-Conhost For issues in the Console codebase
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants