Requiring wchar_t to represent all members of the execution wide character set does not match existing practice #9

tahonermann · 2018-05-02T21:48:44Z

5.13.3 [lex.ccon] p6 states: (http://eel.is/c++draft/lex.ccon#6)

... [ Note: The type wchar_t is able to represent all members of the execution wide-character set
(see [basic.fundamental]). — end note ]

6.7.1 [basic.fundamental] p5 states: (http://eel.is/c++draft/basic.fundamental#5)

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the
largest extended character set specified among the supported locales. ...

However, on Windows, wchar_t is 16-bit and unable to represent all members of the execution wide character set. The standard should be updated to reflect existing practice.

cubbimew · 2018-05-02T21:58:18Z

Pretty sure the execution wide character set on Windows is the set formerly known as "UCS-2".

steve-downey · 2018-05-03T00:24:58Z

In reality, not in ages. As I understand, UCS2 went away in Windows at least 10 years ago.
And if it does still exist it should not be standardized where people will expect working Unicode.

tahonermann · 2018-05-03T04:14:52Z

Pretty sure the execution wide character set on Windows is the set formerly known as "UCS-2".

Microsoft switched from UCS-2 to UTF-16 with the Windows 2000 release 1. Of course, by then, it was too late to change the size of wchar_t.

MSVC encodes characters outside the BMP using surrogate code points as would be expected. For example, the following code is accepted with Visual Studio 2017 (with the /std:c++latest option):

static_assert(L"\U00010000"[0] == 0xD800);
static_assert(L"\U00010000"[1] == 0xDC00);

I guess an argument could be made that U+10000 is not technically a member of the execution wide character set because it can't be represented in a single wchar_t value. But I think most people would find that argument inconsistent with what is generally considered a character.

cubbimew · 2018-05-04T17:22:23Z

Yes, encoding of wide string literals and the recently-added filesystem paths is UTF-16, but wide char literals and the codecvts aren't, I like the idea of standardizing existing practice (even if I don't like the practice), but how different would [lex.ccon]/6 have to become to make it okay for L'\U0001F34C' to have the value 0xd83c?

cubbimew · 2018-05-04T17:30:15Z

...perhaps
"The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character encoding, provided it is representable with a single code unit in that encoding"

dimztimz · 2018-05-05T21:03:05Z

Windows does tricks to conform to the standard. Whenever you use C or C++ standard library stuff, wprintf, wscanf, setlocale, std::locale, etc. wchar_t string are UCS2 strings, and when you use WinAPI, wchar_t strings are UTF-16 strings.

All supported multibyte locales on Windows with setlocale/std::locale can be mapped to UCS2. Because wchar_t is not UTF-32, windows does not support UTF-8 to be set via the standard library locales.

tahonermann · 2018-05-06T16:26:12Z

Yes, encoding of wide string literals and the recently-added filesystem paths is UTF-16, but wide char literals and the codecvts aren't

I'm not sure I'm following here. I don't think it makes much sense to think of wide character literals as having any particular encoding since they can only produce a single code unit. MSVC behaves a bit oddly (as you noticed) by mapping code points outside the BMP to the first code unit of their encoded representation. For example, MSVC accepts the following:

static_assert(L'\U00010000' == 0xD800);

I wouldn't be surprised if this is just an artifact of the implementation and not intentional behavior.

With regard to codecvt, I presume you are under the impression that codecvt<wchar_t, char, mbstate_t> converts between the execution character set and UCS-2? I haven't tested, but I would be quite surprised if that were the case.

I like the idea of standardizing existing practice (even if I don't like the practice), but how different would [lex.ccon]/6 have to become to make it okay for L'\U0001F34C' to have the value 0xd83c?

I don't think it needs to be much different. I think we can borrow and slightly modify [lex.ccon]/3 for this purpose (see below).

...perhaps
"The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character encoding, provided it is representable with a single code unit in that encoding"

I suggest replacing [lex.ccon]/6 with:

"A character literal that begins with the letter L, such as L'z', is a wide-character literal. A wide-character literal has type wchar_t. The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set provided that the c-char value is representable in the execution wide-character set and is representable with a single code unit. If the c-char is not representable in the execution wide-character set or would require multiple code units, then the value is implementation-defined. The value of a wide-character literal containing multiple c-chars is implementation-defined."

Note that updates will still be needed for [basic.fundamental]/5, and possibly other places.

tahonermann · 2018-05-07T03:50:47Z

Windows does tricks to conform to the standard. Whenever you use C or C++ standard library stuff, wprintf, wscanf, setlocale, std::locale, etc. wchar_t string are UCS2 strings ...

Can you provide some evidence for that claim? It doesn't match my understanding.

cubbimew · 2018-05-07T21:52:33Z

MS CRT's state notwithstanding, this raises a point about the specification of the C library's APIs. In order to support wchar_t representing a code unit rather than a code point, mbrtowc and wcrtomb would have to be modified to match mbrtoc16 and (after a post-C11 defect report) c16rtomb.

MSDN agrees: mbrtoc16 lists the -3 return code (and actually works as expected in my tests), mbrtowc doesn't.

dimztimz · 2018-05-10T12:26:54Z

It's all messed up on MSDN, this is as close as I can get
https://msdn.microsoft.com/en-us/library/x99tb11d.aspx

The set of available locale names, languages, country/region codes, and code pages includes all those supported by the Windows NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8.

This means mb*to*w functions wont convert utf-8

https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx

This one does. They both take char* and output into wchar_t*.

tahonermann · 2018-05-10T17:45:44Z

MS CRT's state notwithstanding, this raises a point about the specification of the C library's APIs. In order to support wchar_t representing a code unit rather than a code point, mbrtowc and wcrtomb would have to be modified to match mbrtoc16 and (after a post-C11 defect report) c16rtomb.

I agree that the specification of mbrtowc and wcrtomb (and other wide character related functions) will require updates, however, I think the goal would be to update them to specify implementation defined behavior if a code point would require multiple code units rather than updating them to match mbrtoc16 and c16rtomb. I don't think we should try and impose behavioral changes on existing implementations (even if the behavior is broken).

Note that Microsoft currently does not support using setlocale to specify UTF-7 or UTF-8, so provoking one of these functions such that a surrogate pair would be produced would require using a locale with a non-Unicode encoding that has characters that are mapped outside the BMP. It would be interesting to test what actually happens in this case.

[1]: https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale

tahonermann · 2018-05-10T17:47:43Z

And I now see that I basically restated some things that @dimztimz already stated. Sorry for the redundancy!

dimztimz · 2018-05-10T18:07:03Z

Trying to redefine wide strings as variable length encoded strings will most likely fail, breaks too much. I'd vote against such a feature.

Instead let's focus on defining char16_t and char32_t right, first the language features i.e. impose UTF-16 and UTF-32, and then define library features.

tahonermann · 2018-05-10T18:44:36Z

Trying to redefine wide strings as variable length encoded strings will most likely fail, breaks too much. I'd vote against such a feature.

The goal is to update the standard to reflect actual existing practice, not to change actual behavior. The updates should not require any implementors to change behavior, nor break any existing code.

Instead let's focus on defining char16_t and char32_t right, first the language features i.e. impose UTF-16 and UTF-32, and then define library features.

I don't see a need for a dependency relationship here. Martinho submitted P1041R0 for the pre-Rapperswil mailing to mandate use of UTF-16/UTF-32 for char16_t/char32_t. See issue #6. I agree we'll need to expand library support for char16_t/char32_t.

cubbimew · 2018-05-10T18:51:52Z

I agree we'll need to expand library support for char16_t/char32_t.

This was attempted by n2035 in 2006 and only a tiny part of it passed through WG21 in Portland:

char_traits: 12 0
iostream: 1 9
fstream: 3 4
sstream: 2 1
facets (excluding codecvt): 3 4
codecvt: 11 0
regex: 2 7

(votes also listed in the next revision, n2207 )

tahonermann · 2018-05-10T19:07:37Z

This was attempted by n2035 in 2006 and only a tiny part of it passed through WG21 in Portland:

Thanks for that reference. What was proposed may not be what we would want to propose now. I meant that, generally, we need to expand support for char16_t/char32_t.

tahonermann · 2020-03-01T01:50:34Z

I've started on a paper to address this, so assigning myself to it.

tahonermann · 2021-09-29T17:24:02Z

Removed myself as an assignee since Corentin now has a draft paper (D2460R0) to address this issue.

tahonermann · 2022-08-11T05:08:55Z

I am closing this issue as resolved by the adoption of P2460R2 for C++23 despite remaining issues. The paper as adopted relaxes the restriction that wchar_t be able to hold all members of the character set associated with the wide literal encoding, but does not relax that restriction on the (run-time locale sensitive) execution wide-character set used by the standard library. The adopted change matches existing practice as demonstrated by Microsoft's implementation. Further work to relax restrictions for the standard library awaits a proposal with an acceptable migration plan.

tahonermann added help wanted Extra attention is needed clarification Something isn't clear labels May 2, 2018

tahonermann added the paper needed A paper proposing a specific solution is needed label Aug 6, 2018

tahonermann self-assigned this Mar 1, 2020

tahonermann removed the help wanted Extra attention is needed label Mar 1, 2020

tahonermann assigned cor3ntin Mar 28, 2021

tahonermann removed their assignment Sep 29, 2021

tahonermann removed the paper needed A paper proposing a specific solution is needed label Sep 29, 2021

Flamefire mentioned this issue Jun 20, 2022

<locale>: Converting UTF-8 sequences to wchar_t/UTF-16 surrogates impossible microsoft/STL#2788

Closed

tahonermann mentioned this issue Aug 11, 2022

WG14: Relax requirements on wchar_t to match existing practices #78

Open

tahonermann closed this as completed Aug 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requiring wchar_t to represent all members of the execution wide character set does not match existing practice #9

Requiring wchar_t to represent all members of the execution wide character set does not match existing practice #9

tahonermann commented May 2, 2018

cubbimew commented May 2, 2018

steve-downey commented May 3, 2018

tahonermann commented May 3, 2018

cubbimew commented May 4, 2018

cubbimew commented May 4, 2018

dimztimz commented May 5, 2018

tahonermann commented May 6, 2018

tahonermann commented May 7, 2018

cubbimew commented May 7, 2018 •

edited

Loading

dimztimz commented May 10, 2018

tahonermann commented May 10, 2018

tahonermann commented May 10, 2018

dimztimz commented May 10, 2018

tahonermann commented May 10, 2018

cubbimew commented May 10, 2018 •

edited

Loading

tahonermann commented May 10, 2018

tahonermann commented Mar 1, 2020

tahonermann commented Sep 29, 2021

tahonermann commented Aug 11, 2022

Requiring wchar_t to represent all members of the execution wide character set does not match existing practice #9

Requiring wchar_t to represent all members of the execution wide character set does not match existing practice #9

Comments

tahonermann commented May 2, 2018

cubbimew commented May 2, 2018

steve-downey commented May 3, 2018

tahonermann commented May 3, 2018

cubbimew commented May 4, 2018

cubbimew commented May 4, 2018

dimztimz commented May 5, 2018

tahonermann commented May 6, 2018

tahonermann commented May 7, 2018

cubbimew commented May 7, 2018 • edited Loading

dimztimz commented May 10, 2018

tahonermann commented May 10, 2018

tahonermann commented May 10, 2018

dimztimz commented May 10, 2018

tahonermann commented May 10, 2018

cubbimew commented May 10, 2018 • edited Loading

tahonermann commented May 10, 2018

tahonermann commented Mar 1, 2020

tahonermann commented Sep 29, 2021

tahonermann commented Aug 11, 2022

cubbimew commented May 7, 2018 •

edited

Loading

cubbimew commented May 10, 2018 •

edited

Loading