Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requiring wchar_t to represent all members of the execution wide character set does not match existing practice #9

Closed
tahonermann opened this issue May 2, 2018 · 19 comments
Assignees
Labels
clarification Something isn't clear

Comments

@tahonermann
Copy link
Member

5.13.3 [lex.ccon] p6 states: (http://eel.is/c++draft/lex.ccon#6)

... [ Note: The type wchar_­t is able to represent all members of the execution wide-character set
(see [basic.fundamental]). — end note ]

6.7.1 [basic.fundamental] p5 states: (http://eel.is/c++draft/basic.fundamental#5)

Type wchar_­t is a distinct type whose values can represent distinct codes for all members of the
largest extended character set specified among the supported locales. ...

However, on Windows, wchar_t is 16-bit and unable to represent all members of the execution wide character set. The standard should be updated to reflect existing practice.

@tahonermann tahonermann added help wanted Extra attention is needed clarification Something isn't clear labels May 2, 2018
@cubbimew
Copy link

cubbimew commented May 2, 2018

Pretty sure the execution wide character set on Windows is the set formerly known as "UCS-2".

@steve-downey
Copy link
Collaborator

In reality, not in ages. As I understand, UCS2 went away in Windows at least 10 years ago.
And if it does still exist it should not be standardized where people will expect working Unicode.

@tahonermann
Copy link
Member Author

Pretty sure the execution wide character set on Windows is the set formerly known as "UCS-2".

Microsoft switched from UCS-2 to UTF-16 with the Windows 2000 release 1. Of course, by then, it was too late to change the size of wchar_t.

MSVC encodes characters outside the BMP using surrogate code points as would be expected. For example, the following code is accepted with Visual Studio 2017 (with the /std:c++latest option):

static_assert(L"\U00010000"[0] == 0xD800);
static_assert(L"\U00010000"[1] == 0xDC00);

I guess an argument could be made that U+10000 is not technically a member of the execution wide character set because it can't be represented in a single wchar_t value. But I think most people would find that argument inconsistent with what is generally considered a character.

@cubbimew
Copy link

cubbimew commented May 4, 2018

Yes, encoding of wide string literals and the recently-added filesystem paths is UTF-16, but wide char literals and the codecvts aren't, I like the idea of standardizing existing practice (even if I don't like the practice), but how different would [lex.ccon]/6 have to become to make it okay for L'\U0001F34C' to have the value 0xd83c?

@cubbimew
Copy link

cubbimew commented May 4, 2018

...perhaps
"The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character encoding, provided it is representable with a single code unit in that encoding"

@dimztimz
Copy link

dimztimz commented May 5, 2018

Windows does tricks to conform to the standard. Whenever you use C or C++ standard library stuff, wprintf, wscanf, setlocale, std::locale, etc. wchar_t string are UCS2 strings, and when you use WinAPI, wchar_t strings are UTF-16 strings.

All supported multibyte locales on Windows with setlocale/std::locale can be mapped to UCS2. Because wchar_t is not UTF-32, windows does not support UTF-8 to be set via the standard library locales.

@tahonermann
Copy link
Member Author

Yes, encoding of wide string literals and the recently-added filesystem paths is UTF-16, but wide char literals and the codecvts aren't

I'm not sure I'm following here. I don't think it makes much sense to think of wide character literals as having any particular encoding since they can only produce a single code unit. MSVC behaves a bit oddly (as you noticed) by mapping code points outside the BMP to the first code unit of their encoded representation. For example, MSVC accepts the following:

static_assert(L'\U00010000' == 0xD800);

I wouldn't be surprised if this is just an artifact of the implementation and not intentional behavior.

With regard to codecvt, I presume you are under the impression that codecvt<wchar_t, char, mbstate_t> converts between the execution character set and UCS-2? I haven't tested, but I would be quite surprised if that were the case.

I like the idea of standardizing existing practice (even if I don't like the practice), but how different would [lex.ccon]/6 have to become to make it okay for L'\U0001F34C' to have the value 0xd83c?

I don't think it needs to be much different. I think we can borrow and slightly modify [lex.ccon]/3 for this purpose (see below).

...perhaps
"The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character encoding, provided it is representable with a single code unit in that encoding"

I suggest replacing [lex.ccon]/6 with:

"A character literal that begins with the letter L, such as L'z', is a wide-character literal. A wide-character literal has type wchar_­t. The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set provided that the c-char value is representable in the execution wide-character set and is representable with a single code unit. If the c-char is not representable in the execution wide-character set or would require multiple code units, then the value is implementation-defined. The value of a wide-character literal containing multiple c-chars is implementation-defined."

Note that updates will still be needed for [basic.fundamental]/5, and possibly other places.

@tahonermann
Copy link
Member Author

Windows does tricks to conform to the standard. Whenever you use C or C++ standard library stuff, wprintf, wscanf, setlocale, std::locale, etc. wchar_t string are UCS2 strings ...

Can you provide some evidence for that claim? It doesn't match my understanding.

@cubbimew
Copy link

cubbimew commented May 7, 2018

MS CRT's state notwithstanding, this raises a point about the specification of the C library's APIs. In order to support wchar_t representing a code unit rather than a code point, mbrtowc and wcrtomb would have to be modified to match mbrtoc16 and (after a post-C11 defect report) c16rtomb.

MSDN agrees: mbrtoc16 lists the -3 return code (and actually works as expected in my tests), mbrtowc doesn't.

@dimztimz
Copy link

It's all messed up on MSDN, this is as close as I can get
https://msdn.microsoft.com/en-us/library/x99tb11d.aspx

The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8.

This means mb*to*w functions wont convert utf-8

https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx

This one does. They both take char* and output into wchar_t*.

@tahonermann
Copy link
Member Author

MS CRT's state notwithstanding, this raises a point about the specification of the C library's APIs. In order to support wchar_t representing a code unit rather than a code point, mbrtowc and wcrtomb would have to be modified to match mbrtoc16 and (after a post-C11 defect report) c16rtomb.

I agree that the specification of mbrtowc and wcrtomb (and other wide character related functions) will require updates, however, I think the goal would be to update them to specify implementation defined behavior if a code point would require multiple code units rather than updating them to match mbrtoc16 and c16rtomb. I don't think we should try and impose behavioral changes on existing implementations (even if the behavior is broken).

Note that Microsoft currently does not support using setlocale to specify UTF-7 or UTF-8, so provoking one of these functions such that a surrogate pair would be produced would require using a locale with a non-Unicode encoding that has characters that are mapped outside the BMP. It would be interesting to test what actually happens in this case.

[1]: https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale

@tahonermann
Copy link
Member Author

And I now see that I basically restated some things that @dimztimz already stated. Sorry for the redundancy!

@dimztimz
Copy link

Trying to redefine wide strings as variable length encoded strings will most likely fail, breaks too much. I'd vote against such a feature.

Instead let's focus on defining char16_t and char32_t right, first the language features i.e. impose UTF-16 and UTF-32, and then define library features.

@tahonermann
Copy link
Member Author

Trying to redefine wide strings as variable length encoded strings will most likely fail, breaks too much. I'd vote against such a feature.

The goal is to update the standard to reflect actual existing practice, not to change actual behavior. The updates should not require any implementors to change behavior, nor break any existing code.

Instead let's focus on defining char16_t and char32_t right, first the language features i.e. impose UTF-16 and UTF-32, and then define library features.

I don't see a need for a dependency relationship here. Martinho submitted P1041R0 for the pre-Rapperswil mailing to mandate use of UTF-16/UTF-32 for char16_t/char32_t. See issue #6. I agree we'll need to expand library support for char16_t/char32_t.

@cubbimew
Copy link

cubbimew commented May 10, 2018

I agree we'll need to expand library support for char16_t/char32_t.

This was attempted by n2035 in 2006 and only a tiny part of it passed through WG21 in Portland:

char_traits: 12 0
iostream: 1 9
fstream: 3 4
sstream: 2 1
facets (excluding codecvt): 3 4
codecvt: 11 0
regex: 2 7

(votes also listed in the next revision, n2207 )

@tahonermann
Copy link
Member Author

This was attempted by n2035 in 2006 and only a tiny part of it passed through WG21 in Portland:

Thanks for that reference. What was proposed may not be what we would want to propose now. I meant that, generally, we need to expand support for char16_t/char32_t.

@tahonermann tahonermann added the paper needed A paper proposing a specific solution is needed label Aug 6, 2018
@tahonermann
Copy link
Member Author

I've started on a paper to address this, so assigning myself to it.

@tahonermann tahonermann self-assigned this Mar 1, 2020
@tahonermann tahonermann removed the help wanted Extra attention is needed label Mar 1, 2020
@tahonermann tahonermann removed their assignment Sep 29, 2021
@tahonermann
Copy link
Member Author

Removed myself as an assignee since Corentin now has a draft paper (D2460R0) to address this issue.

@tahonermann
Copy link
Member Author

I am closing this issue as resolved by the adoption of P2460R2 for C++23 despite remaining issues. The paper as adopted relaxes the restriction that wchar_t be able to hold all members of the character set associated with the wide literal encoding, but does not relax that restriction on the (run-time locale sensitive) execution wide-character set used by the standard library. The adopted change matches existing practice as demonstrated by Microsoft's implementation. Further work to relax restrictions for the standard library awaits a proposal with an acceptable migration plan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Something isn't clear
Development

No branches or pull requests

5 participants