-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 decoding problem when a codepoint straddles an i/o boundary #17862
Comments
To my horror I found that cmd.exe uses the current codepage to translate the output of To fix this issue we need to do two things:
It may be possible to just do the latter only, given that we don't support joining DBCS in any API except for stdin. |
This is #386 by the way. @lhecker you eventually confirmed that this is a CMD bug. 👍 @aidtopia if you need to work around this in a CMD shell then use another command line tool to write the file content. Windows ships with findstr.exe which is suitable.
|
I'm glad to hear the UTF-8 problem is understood. The problem with the combining characters is more subtle than I realized. It's probably a separate issue. If I find a more illustrative repro, I'll file another bug report for just that. @german-one: That's a clever use of |
Just to be sure, since it wasn't yet mentioned here: "Windows Terminal Preview" 1.22 is the first version that supports combining characters. You can find it in the Microsoft Store app and in our releases page. |
Ah, that explains why I haven't been able to reproduce exactly what I saw before. I must've first spotted the combining bug while using a (probably quite old) version of Preview, but just very recently switched to the mainstream release because ... reasons. When I get a chance, I'll try the current Preview release. |
Windows Terminal version
1.20.11781.0
Windows build number
10.0.19045.4780
Other Software
No response
Steps to reproduce
chcp 65001
type foo.txt
Note that, near the end of the output there are a couple Unicode replacement characters.
What's happening is that
type
sends the text to the terminal in 512-byte blocks. The UTF-8 encoding of U+20B0 takes 3 bytes. Since 512 isn't a multiple of 3, the 171st German Penny Sign is split across the boundary of the first and second write operations issued bytype
. The UTF-8 decoding is resetting state state with each write.But it's not just UTF-8 decoding. If one write ends with a complete character, and the next write begins with a combining character, they either (1) won't be composed or (2) they will be composed but there will be an empty cell immediately after it.
These problems occur less frequently with applications that issue larger writes, but they do still happen. They can even happen with applications that normally flush the output on line boundaries if a single line grows so long that an intermediate flush occurs.
foo.txt
Expected Behavior
I expected UTF-8 decoding and composition of combining characters to resync if a sequence of bytes that represents a single codepoint or grapheme cluster happens to fall on the boundary between two consecutive writes.
Actual Behavior
Note the replacement characters in the output.
The text was updated successfully, but these errors were encountered: