-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requiring wchar_t to represent all members of the execution wide character set does not match existing practice #9
Comments
Pretty sure the execution wide character set on Windows is the set formerly known as "UCS-2". |
In reality, not in ages. As I understand, UCS2 went away in Windows at least 10 years ago. |
Microsoft switched from UCS-2 to UTF-16 with the Windows 2000 release 1. Of course, by then, it was too late to change the size of MSVC encodes characters outside the BMP using surrogate code points as would be expected. For example, the following code is accepted with Visual Studio 2017 (with the /std:c++latest option):
I guess an argument could be made that U+10000 is not technically a member of the execution wide character set because it can't be represented in a single |
Yes, encoding of wide string literals and the recently-added filesystem paths is UTF-16, but wide char literals and the codecvts aren't, I like the idea of standardizing existing practice (even if I don't like the practice), but how different would [lex.ccon]/6 have to become to make it okay for |
...perhaps |
Windows does tricks to conform to the standard. Whenever you use C or C++ standard library stuff, All supported multibyte locales on Windows with |
I'm not sure I'm following here. I don't think it makes much sense to think of wide character literals as having any particular encoding since they can only produce a single code unit. MSVC behaves a bit oddly (as you noticed) by mapping code points outside the BMP to the first code unit of their encoded representation. For example, MSVC accepts the following:
I wouldn't be surprised if this is just an artifact of the implementation and not intentional behavior. With regard to
I don't think it needs to be much different. I think we can borrow and slightly modify [lex.ccon]/3 for this purpose (see below).
I suggest replacing [lex.ccon]/6 with: "A character literal that begins with the letter L, such as L'z', is a wide-character literal. A wide-character literal has type wchar_t. The value of a wide-character literal containing a single c-char has value equal to the numerical value of the encoding of the c-char in the execution wide-character set provided that the c-char value is representable in the execution wide-character set and is representable with a single code unit. If the c-char is not representable in the execution wide-character set or would require multiple code units, then the value is implementation-defined. The value of a wide-character literal containing multiple c-chars is implementation-defined." Note that updates will still be needed for [basic.fundamental]/5, and possibly other places. |
Can you provide some evidence for that claim? It doesn't match my understanding. |
MS CRT's state notwithstanding, this raises a point about the specification of the C library's APIs. In order to support wchar_t representing a code unit rather than a code point, mbrtowc and wcrtomb would have to be modified to match mbrtoc16 and (after a post-C11 defect report) c16rtomb. MSDN agrees: mbrtoc16 lists the -3 return code (and actually works as expected in my tests), mbrtowc doesn't. |
It's all messed up on MSDN, this is as close as I can get
This means https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072(v=vs.85).aspx This one does. They both take |
I agree that the specification of Note that Microsoft currently does not support using [1]: https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale |
And I now see that I basically restated some things that @dimztimz already stated. Sorry for the redundancy! |
Trying to redefine wide strings as variable length encoded strings will most likely fail, breaks too much. I'd vote against such a feature. Instead let's focus on defining char16_t and char32_t right, first the language features i.e. impose UTF-16 and UTF-32, and then define library features. |
The goal is to update the standard to reflect actual existing practice, not to change actual behavior. The updates should not require any implementors to change behavior, nor break any existing code.
I don't see a need for a dependency relationship here. Martinho submitted P1041R0 for the pre-Rapperswil mailing to mandate use of UTF-16/UTF-32 for char16_t/char32_t. See issue #6. I agree we'll need to expand library support for char16_t/char32_t. |
This was attempted by n2035 in 2006 and only a tiny part of it passed through WG21 in Portland: char_traits: 12 0 (votes also listed in the next revision, n2207 ) |
Thanks for that reference. What was proposed may not be what we would want to propose now. I meant that, generally, we need to expand support for char16_t/char32_t. |
I've started on a paper to address this, so assigning myself to it. |
Removed myself as an assignee since Corentin now has a draft paper (D2460R0) to address this issue. |
I am closing this issue as resolved by the adoption of P2460R2 for C++23 despite remaining issues. The paper as adopted relaxes the restriction that |
5.13.3 [lex.ccon] p6 states: (http://eel.is/c++draft/lex.ccon#6)
6.7.1 [basic.fundamental] p5 states: (http://eel.is/c++draft/basic.fundamental#5)
However, on Windows, wchar_t is 16-bit and unable to represent all members of the execution wide character set. The standard should be updated to reflect existing practice.
The text was updated successfully, but these errors were encountered: