Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to write as UTF-8 to console? #396

Closed
davidanthoff opened this issue Mar 27, 2019 · 20 comments
Closed

How to write as UTF-8 to console? #396

davidanthoff opened this issue Mar 27, 2019 · 20 comments
Assignees
Labels
Issue-Question For questions or discussion Product-Conhost For issues in the Console codebase

Comments

@davidanthoff
Copy link

Various blogs have mentioned that the internal buffer for the console now can represent full UTF-8. What is not clear to me is how, as a normal command line app, I can write/print/output UTF-8 to the console. Do I still use WriteConsole? But how do I signal that what I'm passing is UTF-8 and not UCS-2?

@miniksa
Copy link
Member

miniksa commented Mar 27, 2019

Use SetConsoleOutputCP and/or SetConsoleCP to set CP_UTF8 which is 65001.

If you use the stream based APIs of the console like WriteConsole, WriteFile, ReadConsole, and ReadFile, it should work fine. If you start using more of the complicated APIs that use structured data like ReadConsoleOutput, it is probably going to have issues that you won't like for assorted reasons I don't have time to get into right now.

@miniksa miniksa added Issue-Question For questions or discussion Product-Conhost For issues in the Console codebase labels Mar 27, 2019
@miniksa miniksa self-assigned this Mar 27, 2019
@davidanthoff
Copy link
Author

But the buffers I pass to WriteConsole still have to be UCS-2 encoded in that case? Or can I pass a buffer that is actually UTF-8 encoded? I just tried the latter, and that doesn't seem to work.

@miniksa
Copy link
Member

miniksa commented Mar 27, 2019

WriteConsoleA with SetConsoleOutputCP set to CP_UTF8 should accept a UTF-8 encoded stream if your revision of Windows is high enough to contain the support. If you have 1809 or 1903, it should be fine. I don't know about 1803.

@davidanthoff
Copy link
Author

Ah, that works, thanks! I had been trying WriteConsoleW!

@davidanthoff
Copy link
Author

It would be nice if this was documented somewhere :)

@miniksa
Copy link
Member

miniksa commented Mar 27, 2019

You're welcome to drop an issue or send a PR to the docs site: https://github.com/MicrosoftDocs/Console-Docs

@davidanthoff
Copy link
Author

I won't. I'm happy to contribute to open source projects, but last time I checked I paid for my Windows license :)

@miniksa
Copy link
Member

miniksa commented Mar 28, 2019

That's fine. You do you.

@davidanthoff
Copy link
Author

Ok, one more question :) Are there any features/things that only work if one uses the WriteConsoleA with UTF-8 strings, or does it essentially not matter which API one uses?

One thing is that one probably can represent some extra unicode chars that can't be represented as UCS-2, right? Should one expect a performance difference? Any other difference worth keeping in mind?

@miniksa
Copy link
Member

miniksa commented Apr 1, 2019

Sorry, I'm not willing to keep answering your questions because you were flippant about your Windows licensing costs.

@davidanthoff
Copy link
Author

I’m sorry, I did not mean to be flippant at all! But I’m not willing to provide free labor for a product that Microsoft then charges a license fee for. Heck, I don’t even mind that arrangement at all, I have no beef whatsoever with you guys charging for Windows, I’m a happy, willingly paying customer. But don’t ask me to work for free for your commercial product.

@davidanthoff
Copy link
Author

And just to be even more clear: I of course also don't mind if you ask me to do something, I just won't do it. And sorry again for the less than ideal wording in my response above.

@miniksa
Copy link
Member

miniksa commented Apr 2, 2019

It's fine if you feel that way. I just want you to understand that the dev team who works on this thing and has minimal control over product decisions and business process including licensing, that your statements about "I've already paid for it" rub us severely the wrong way.

We're trying to build a community here of developers helping developers directly because we believe it's the most expedient way to help each other out and that helping someone out should be done whether or not it is in someone's specific job description and regardless of how money is being exchanged. To us, a community of where everyone can help out a little bit is a community where everyone's lives get a bit better.

We ask for your help, like we ask for anyone in the community's help, not because we're asking you to work for free and you've already paid for it. No one likes working for free. It's because we don't see the world the way an external person sees the world. We believe that your description of how this works for you, the problems you encountered while trying to use our software, or the bug ticket in your words on the appropriate tracker conveys a more accurate picture of the world than we are capable of as folks on the inside.

Given our limited time as folks on the inside with a ton of folks shouting at us from every angle, we prefer to work with people who appear at least mildly sympathetic to where our small dev team falls as cogs within a giant machine and are willing to help us help everyone in the mildest way. Your comments don't strike me that way, and as such...

I of course also don't mind if you ask me to do something, I just won't do it.

@davidanthoff
Copy link
Author

I'm highly sympathetic to your situation, and I think what you and your team are doing is awesome. I did not mean my comments to be confrontational at all, clearly that misfired and I apologize for them. I do have a very limited time budget for this kind of stuff, and I generally devote that to open source projects. Please don't interpret that in any form as criticism of 1) what you do, 2) what your team does, 3) or even the general arrangement of how Microsoft sells Windows. As I wrote above, I'm perfectly happy to buy Windows and getting an awesome product in return, no qualms with that in any form. I am a happy customer, and I think the kind of outreach and user engagement you are doing here at the console team is fantastic.

I'm interested in figuring out the answer to my question for the libuv project. That powers the console experience for nodejs and julia, and probably many others. Currently it takes UTF-8 encoded strings from "clients" (like julia and nodejs), converts them to UCS-2 and then calls the WriteConsoleW API. I suggested over there that one could just pass the UTF-8 strings directly (like you suggested above), and one question that came up was what one would gain from that, given that the existing implementation works, and that they want to continue to support older Windows versions and are worried about code complexity. So I'm simply trying to understand whether there are things that only work when one uses the UTF-8 API, or whether some things work better that way.

@miniksa
Copy link
Member

miniksa commented Apr 2, 2019

There is no additional functionality you gain by passing UTF-8 over the A API versus passing UTF-16 over the W API.

We will convert one or the other into whatever format is required for us to maintain the internal cellular storage. If the cellular storage improves release over release, then it will improve for transmissions on both API surfaces.

@vtjnash
Copy link

vtjnash commented Apr 2, 2019

Hi miniksa, I just like to confirm something in your comment just now, since I've had to field this question on several occasions: In the past, the Console (via W) only supported the UCS-2 subset of UTF-16, presumably for backwards compatibility. Are you saying there's a plan to change W to support full UTF-16 (or already have), or just that the UTF-8 (codepage 65001) is intended to give the same behavior? If this is changing, I'd just note that wikipedia currently says this explicitly, so it would be great if the appropriate team at Microsoft could document it officially, and then fix the citation link at https://en.wikipedia.org/wiki/Win32_console#Windows_NT_and_Windows_CE

@miniksa
Copy link
Member

miniksa commented Apr 2, 2019

When it is officially supported and completed, we will document it. Until then, it's been a multi-release journey that is still incomplete and buggy. It is in progress. You should still stay below U+FFFF inside the UCS-2 boundary on both the A and W APIs until officially announced and documented.

We will be unable to fix the Wikipedia article as I believe their code of conduct prohibits the people who work on the thing or who are the thing to write the article about themselves. We would update docs.microsoft.com and likely our blog when the console is capable of crossing the UCS-2 boundary on our APIs, A or W.

@vtjnash
Copy link

vtjnash commented Apr 2, 2019

OK, I understand, that's great to hear.

@eryksun
Copy link

eryksun commented Apr 8, 2019

If you use the stream based APIs of the console like WriteConsole, WriteFile, ReadConsole, and ReadFile, it should work fine
[...]
WriteConsoleA with SetConsoleOutputCP set to CP_UTF8 should accept a UTF-8 encoded stream if your revision of Windows is high enough to contain the support. If you have 1809 or 1903, it should be fine. I don't know about 1803.

The console in 1803 does not support reading the input buffer as UTF-8. Specifically, non-ASCII characters are converted to null characters because it can't handle the multibyte encoding (e.g. "abcĀdef" is read as "abc\x00def") . Prior to Windows 10 it didn't even keep ASCII characters. The call would succeed with 0 bytes read, which looks like EOF.

Writing to the screen buffer works well in Windows 8+. However, with older versions, WriteConsoleA and WriteFile (to the console) mistakenly return the number of UCS-2 codes written instead of the number of bytes, which confuses buffered writers, including C FILE streams. This results in a sequence of writes that appear as random characters after a write that contains non-ASCII characters.

All in all, if you're supporting Windows 7 still, then using UTF-8 in the standard Windows console is probably not an option. It will be good enough if you're just writing to the console directly via WriteConsoleA or WriteFile instead of using a buffered stream and don't need to read non-ASCII characters.

@ebickle
Copy link

ebickle commented Dec 13, 2020

Apologies about posting on an old closed issue, but I had a question or two regarding the advice to use SetConsoleOutputCP with CP_UTF8 with WriteConsoleA (as well as cout and related single-byte streams, I assume).

I'm assisting with a large, existing codebase that uses UTF-8 for console output and is cross-platform - everything using cout/printf, as well as through integrated third-party libraries that can't be switched to w versions.

SetConsoleOutputCP seems to work, but it has the effect of changing the codepage for the entire console and keeping that setting in effect even after the calling process exits. This can lead to some potentially unexpected behavior:

  • User opens a terminal window, executes a command line application that has UTF8 output, application calls SetConsoleOutputCP then exits, user then run a second command line application that is not UTF8 aware and the second application crashes or behaves unexpectedly.
  • User opens a terminal window, executes a command line application that has UTF8 output and pipes the output to second command line application that is not UTF8 aware. In this scenario only ASCII7 is output despite the console mode change - but the second application does not expect it.

What's the best practice to resolve these cases? I've seen quite a bit of advice (not just here) to call SetConsoleOutputCP with CP_UTF8 but nothing on how to resolve the latent side-effects - or whether it's worth even considering them :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue-Question For questions or discussion Product-Conhost For issues in the Console codebase
Projects
None yet
Development

No branches or pull requests

5 participants