Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Win95 println!("Hello, 世界!") panics when using Chinese locale #13

Closed
2moe opened this issue Dec 31, 2023 · 14 comments
Closed

Win95 println!("Hello, 世界!") panics when using Chinese locale #13

2moe opened this issue Dec 31, 2023 · 14 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@2moe
Copy link

2moe commented Dec 31, 2023

Screenshot_2023-12-31__11-54-48

fn main() {
    println!("Hello, 世界!");
}

On win95, as long as unicows.dll is included, it will output unicode characters, but the program will end up in panic.

On WinXP, the same program will not panic.
But XP has other unicode problems.

If the "non-unicode profile" is English, then the unicode character becomes "??", which does not automatically fallback to the corresponding font. And I think this may be a problem with the WinXP cmd itself.

bad:
Screenshot_2023-12-31__12-46-39

good:
Screenshot_2023-12-31__12-57-11

good:
Screenshot_2023-12-31__12-41-35

@2moe
Copy link
Author

2moe commented Dec 31, 2023

BTW, SDK:

  • Microsoft Platform SDK February 2003
  • Microsoft Visual C++ Toolkit 2003

@seritools
Copy link
Member

seritools commented Jan 1, 2024

Thanks for testing these!

And I think this may be a problem with the WinXP cmd itself.

Yeah, on WinXP there is no unicows fallbacks being loaded adn the commands go straight through to WriteConsoleW just like on modern Windows, so it seems more like an issue of the old cmd.exe not fully running in unicode mode or something similar.

EDIT: Ah, might actually be fixable!

This function uses either Unicode characters or 8-bit characters from the console's current code page. The console's code page defaults initially to the system's OEM code page. To change the console's code page, use the SetConsoleCP or SetConsoleOutputCP functions.

I'll try to reproduce it and add a workaround/fix! From this stackoverflow post it seems like you have to switch to a truetype font like Luicda Console and maybe even need to set up font fallback for asian characters (SimHeim, SimSun, MS PGothic, etc). Anyways, it doesn't seem to be a rust9x-specific problem.

In fact, if you enable the (recently fully deprecated) legacy console in modern Windows versions:
image

You'll see the same behavior
image

On win95, as long as unicows.dll is included, it will output unicode characters, but the program will end up in panic.

I'm more surprised that it actually manages to output those characters, at least if your locale/windows language doesn't include them! Interesting, I'll try it on my Win98 system and see if I can figure out where the panic comes from.

@seritools
Copy link
Member

seritools commented Jan 1, 2024

Just tested

fn main() {
    println!("Hello, 世界!");
}

on my Win98 machine, it doesn't crash, but also just writes "Hello, ??!" to the console as expected (since the codepage doesn't have those characters).

What version and language of Win95 did you use?

Regarding the panic - it seems like WriteConsoleW in unicows just passes through the lpNumberOfCharsWritten, and the asian characters use surrogate pairs in UTF16 (=2 chars), but map to a single char in the used code page, causing a difference in length. I'll see what I can do with that check.

EDIT: checking in rust playground, '界'.len_utf16() is 1, so nothing weird here... no idea yet what causes it

@seritools seritools self-assigned this Jan 1, 2024
@2moe
Copy link
Author

2moe commented Jan 1, 2024

What version and language of Win95 did you use?

That's Win95 Chinese Edition, and system language is Chinese.
The exact version may be OSR2.5.
I am running it in a virtual machine.


I have two ideas:

  1. convert the character encoding to UCS-2 or UTF-16 LE, and then use the "W" api to output.
  2. Instead of using unicode, convert the character encoding to the corresponding language encoding, and then use the "A" api.

Considering that 9x and NT have different levels of unicode support, it may be necessary to treat them separately.

Without knowing the underlying details, I've done tests before that show that:
The behavior of SetConsoleOutputCP(CP_UTF8) and chcp 65001 is "almost" the same.
When I use CP_UTF8, I'm actually using the "A" api.

@2moe
Copy link
Author

2moe commented Jan 1, 2024

Off topic:
I don't know the history of the win95/98 era.
I'm curious if back then, if a software had to support languages from multiple countries around the world (including East Asia), it would need to be distributed separately.

@seritools
Copy link
Member

seritools commented Jan 1, 2024

convert the character encoding to UTF-16 LE, and then use the "W" api to output.

that's exactly what the rust stdlib does :) it converts from utf8 to utf16 and calls WriteConsoleW, and then, on 9x/ME ...

Instead of using unicode, convert the character encoding to the corresponding language encoding, and then use the "A" api.

... unicows checks the system (ACP) and console (OEM) codepage, and converts the utf16 to the console codepage, and then calls WriteConsoleA

Considering that 9x and NT have different levels of unicode support, it may be necessary to treat them separately.

that's exactly what unicows is supposed to do ^^

I'm curious if back then, if a software had to support languages from multiple countries around the world (including East Asia), it would need to be distributed separately.

yes, definitely. lots of programs and games were specifically made for a region. there are lots of games that only work correctly with a Japanese locale, for example.

Codepages are byte-based, so they had to hack in support for multibyte characters (since obviously there are more than 256 Chinese characters):
https://learn.microsoft.com/en-us/cpp/c-runtime-library/single-byte-and-multibyte-character-sets?view=msvc-170
https://learn.microsoft.com/en-us/cpp/text/support-for-multibyte-character-sets-mbcss?view=msvc-170
https://learn.microsoft.com/en-us/windows/win32/intl/double-byte-character-sets
(All the pages are begging the reader to just use Unicode :^))

MBCS seems to work like a primitive, language/region-specific version of UTF8. The first half of the first byte stays ASCII (0x00-0x7F) and the second half can be an "MBCS lead byte", meaning that the next byte is part of the same character.

The problem with unicows is that it just doesn't account for the MBCS multibyte characters (I don't think even windows itself does, 'A' apis always just work with byte-strings, but still calls them characters) when returning the "number of characters written". In other words, the string "Hello, 世界!" (plus NUNL byte) is 11 characters, but is 13 bytes: ['H', 'e', 'l', 'l', 'o', ',', ' ', '世' (first half), '世' (second half), '界' (first half), '界' (second half), '!', '\0'].

Rust checks the number of chars written to know how much was actually written, but since it only consisted of 11 utf-16 wchars, the 13 (mbcs bytes) will be out of bounds when indexing.

So yeah, in the end,

Instead of using unicode, convert the character encoding to the corresponding language encoding, and then use the "A" api.

this is needed. I think doing the conversion on the stdlib side makes sense, so we know how many bytes we expect to write out. Thankfully console I/O is probably the only area where this is needed.

@seritools seritools added bug Something isn't working enhancement New feature or request good first issue Good for newcomers and removed good first issue Good for newcomers labels Jan 1, 2024
@seritools seritools changed the title win95 "Hello world" panic Win95 "Hello world" panic when using multi-byte encoding locale Jan 1, 2024
@seritools seritools changed the title Win95 "Hello world" panic when using multi-byte encoding locale Win95 println!("Hello, 世界!") panics when using chinese locale Jan 1, 2024
@seritools
Copy link
Member

Without knowing the underlying details, I've done tests before that show that:
The behavior of SetConsoleOutputCP(CP_UTF8) and chcp 65001 is "almost" the same.
When I use CP_UTF8, I'm actually using the "A" api.

The UTF8 console implementation has been broken and not recommended until very recently (some Windows 10 release I think?). Either way, it won't help with the font rendering issue on Windows XP's cmd.exe, so there is no reason to change it from Rust mainline.

@seritools
Copy link
Member

For 9x/ME:

So there's a right way, and a hacky way:

The right way (roughly what unicows does):

  1. On program init (unicows does it on DLL load)
    a. Get the OEM codepage with GetOEMCP(), check the maximum byte count per character via GetCPInfo()
  2. When writing and on an MBCS codepage:
    a. convert to utf-16 (MultiByteToWideChar), then to the OEM codepage (WideCharToMultiByte)
    b. call WriteConsoleA
    b. check that all bytes are written.

However, if the number of bytes written don't match, you'd have to scan through the string to figure out how many MBCS characters, not bytes have been written, to report the correct usize for the length of written utf8 chars.

The hacky way:

  1. just ignore the number of chars written completely and assume that all writes of <=8KB (console buffer size in Rust) will succeed.

This will actually likely work, as the buffer hopefully isn't smaller than 8K on any Windows version, and thus should always be able to write the entire buffer. I think I'll go with this one and create an improvement issue if someone wants to implement the proper way.

@seritools
Copy link
Member

Oh, it always happens when the number of characters in utf16 doesn't match the number of characters in the output. This can easily happen with emojis as well.

@2moe 2moe changed the title Win95 println!("Hello, 世界!") panics when using chinese locale Win95 println!("Hello, 世界!") panics when using Chinese locale Jan 1, 2024
@seritools
Copy link
Member

@2moe I've added the hacky fix to rust9x for now, and updated the description in #14.

I'll upload a rust9x v2 dist in a bit if you'd like to test :)

@2moe
Copy link
Author

2moe commented Jan 1, 2024

On my pc, it takes about an hour to compile rust9x(stage2) manually.
Maybe letting github actions compile it automatically is a better option.

Right now I'm not home to do the test.
I can then send you a PR to have "github actions" automatically compile and publish to "github releases".

@seritools
Copy link
Member

@2moe
Copy link
Author

2moe commented Jan 1, 2024

@seritools Thank you. You are very warm and friendly. 😊

@2moe
Copy link
Author

2moe commented Jan 7, 2024

It works.

Screenshot_2024-01-06__19-16-36.jpg

seritools added a commit that referenced this issue Dec 1, 2024
- Allow dropping unknown characters. unicows just doesn't understand emojis :(
- Ignore mismatched lengths when writing to console on non-Unicode Windows. (workaround for #13)
seritools added a commit that referenced this issue Dec 1, 2024
- Allow dropping unknown characters. unicows just doesn't understand emojis :(
- Ignore mismatched lengths when writing to console on non-Unicode Windows. (workaround for #13)
seritools added a commit that referenced this issue Dec 3, 2024
- Allow dropping unknown characters. unicows just doesn't understand emojis :(
- Ignore mismatched lengths when writing to console on non-Unicode Windows. (workaround for #13)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants