Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WG21 P1949: Improve support for Unicode characters in identifiers #48

Closed
tahonermann opened this issue May 15, 2019 · 24 comments
Closed
Assignees
Labels
enhancement New feature or request WG21-tracked This issue is tracked as a WG21 github issue

Comments

@tahonermann
Copy link
Member

JF raised this issue on the SG16 mailing list.

Briefly, the standard allows the use of Unicode characters outside the basic source character set to be used in identifiers as specified by [lex.name]p1. The standard does not provide a rationale for the ranges of allowed characters that it specifies. It is likely that the specified ranges are not being maintained as new characters are added in new Unicode releases.

The Unicode consortium has published UAX#31, a technical report covering naming of identifiers. This document may provide a better basis for the C++ standard to base its allowances for use of Unicode characters outside the basic source character set in identifier names.

@tahonermann tahonermann added enhancement New feature or request help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed labels May 15, 2019
@tahonermann
Copy link
Member Author

cppreference.com has a more informative list of the ranges of allowed characters in identifiers.

@ThePhD
Copy link
Collaborator

ThePhD commented May 15, 2019

@strega-nil
Copy link

It's unlikely to happen now, but if at all possible it'd be really good to NFC identifiers.

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 2, 2019

@ubsan agreed, it is strongly encouraged by UAX#31

However, it is putting the cart before the horse.
We only have Unicode identifier portably if the physical character set is able to represent all Unicode code-points.
So before any improvement can be made in this area we need a way to ensure the compiler will treat the file in which such identifier is used as utf-8 (or some other sensible Unicode encoding, such as utf8)

@tahonermann
Copy link
Member Author

We only have Unicode identifier portably if the physical character set is able to represent all Unicode code-points.

That is not strictly correct as identifiers can contain \u1234 escape sequences. I will not comment on the utility of such escape sequences in identifiers other than to say I've never used one outside of a test :)

So before any improvement can be made in this area we need a way to ensure the compiler will treat the file in which such identifier is used as utf-8 (or some other sensible Unicode encoding, such as utf8)

I don't agree with this conclusion. The standard is clear regarding how physical source file characters are mapped to the compiler's internal encoding. Source files are portable so long as the compilers used with them 1) support the actual source file encoding, and 2) are correctly informed about the source file encoding. In my opinion, it is that latter case that we need to improve.

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 2, 2019

Source files are portable so long as the compilers used with them 1) support the actual source file encoding

That's the definition of not portable

Interestingly Microsoft solves that particular problem by always parsing identifiers as utf8 regardless of the actual encoding of the file.
That falls appart if you add reflection to the mix.
At this point identifier are text and the conversion needs to be deterministic and lossless

The standard is not clear. it is completely implementation defined. Aka not portable.
Agreed about 2) but having to specify utf8 it's a terrible default.

As a point of data i learned today that vcpkg build every packages on windows with /utf8

@tahonermann
Copy link
Member Author

That's the definition of not portable

You’ll have to walk me through to that conclusion.

Interestingly Microsoft solves that particular problem by always parsing identifiers as utf8 regardless of the actual encoding of the file.

I’m not sure what you mean by that. Perhaps you mean that identifiers are transcoded from the source file encoding to UTF-8 and then used in that form? Microsoft uses UTF-8 as the internal encoding, so that doesn’t seem surprising.

The standard is not clear.

What isn’t clear?

but having to specify utf8 it's a terrible default.

I don’t disagree, but that doesn’t make it the wrong choice from a backward compatibility and migration perspective.

As a point of data i learned today that vcpkg build every packages on windows with /utf8

Yes, I’ve discussed this with Robert previously. If I recall, he had done some scans and found little use of non-ASCII characters. I don’t find that at all surprising within the Windows ecosystem though since the default source file encoding for the Microsoft compiler is locale sensitive. Programmers on Windows that distribute source files have never been able to assume an encoding other than ASCII (and even that breaks with Shift-JIS). I don’t think the vcpkg experience generalizes particularly well.

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 3, 2019

I’m not sure what you mean by that. Perhaps you mean that identifiers are transcoded from the source file encoding to UTF-8 and then used in that form? Microsoft uses UTF-8 as the internal encoding, so that doesn’t seem surprising.

no, they are NOT transcoded, the sequence of bytes making the identifier seems not to be parsed using the same encoding as the rest of the file

example provided by @ubsan https://gcc.godbolt.org/z/O0309o

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 3, 2019

That's the definition of not portable
What isn’t clear?
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set

Thats work in a pre-internet, mono-platform environment. I cannot trust that. Compilers do not interpret source in a consistent fashion.

[And as you mentioned that forces people to live in an ASCII only world - solution currently is to compile everything with /utf8]

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 3, 2019

If we want Unicode identifiers, not withstanding escape sequence we need to ensure that:

  • void é() has a consistent mangling across compiler that are otherwise abi compatible today (on a given platform)

  • int é is consistently normalized (which excludes interpreting things as bytes) regardless of platform and compiler

  • compilers have a common understanding of the set of valid identifiers

  • ranges::equal(meta::name_of(reflexpr(é)), nfc_view(u8"é")); is consistently true across platforms and compilers

struct é;
static_assert(is_same_v<unqualid(u8"é"), é>);

should be a valid program (unqualid is an utility that transforms a string into an identifier, part of the ongoing metaclasses work)

[é is an example, I'm not suggesting that it should be a valid variable name, i haven't studied uax 31 enough yet]

Note that this presents an interesting issue: name_of is in the ts specified to return a NTBS in the execution encoding

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 3, 2019

WG14 paper: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm

Interesting paper but this bit

I think the C and C++ standards should be silent on this whole topic. An mplementer should be able to decide whether his implementation should normalize or not, and if so which normalization form should be used, based on his understanding of the needs of his customers. The implication of that would be that users should never name different things using identifiers that would normalize to the same string, nor attempt to reference something using anything but its exact name (for example, by using a name that would normalize to the same string as the original name)

is a deal breaker for me - this make Unicode identifiers unusable with reflection, abi, etc
https://unicode.org/reports/tr31/#normalization_and_case

I'm not saying people should start putting non ASCII identifiers in their interfaces but if we want to give that ability, it needs to be reliable

@tahonermann
Copy link
Member Author

no, they are NOT transcoded, the sequence of bytes making the identifier seems not to be parsed using the same encoding as the rest of the file

I think this conclusion is incorrect. I think what you are seeing is typical encoding confusion. In the example you provided, UTF-8 source code is being provided to the compiler, but the compiler is being told to interpret it as Windows 1252. The character in question, 🚙 (U+1F699 RECREATIONAL VEHICLE) has a UTF-8 representation of F0 9F 9A 99. In Windows 1252, this corresponds to "🚙" (U+00F0, U+0178, U+0161, U+2122). Microsoft's documentation for allowed identifiers (https://docs.microsoft.com/en-us/cpp/cpp/identifiers-cpp?view=vs-2019) lists which Unicode code points are allowed. If you cross check that list with the Unicode code points for those characters, you'll see that each one is allowed in identifiers. As for Godbolt then displaying the original Unicode character in the disassembly window, I believe that is technically a bug in Godbolt. The disassembly output is very likely Windows 1252, but is being interpreted as UTF-8.

@tahonermann
Copy link
Member Author

Thats work in a pre-internet, mono-platform environment. I cannot trust that. Compilers do not interpret source in a consistent fashion.

I don't see how that is relevant. The claim I made is that "Source files are portable so long as the compilers used with them 1) support the actual source file encoding, and 2) are correctly informed about the source file encoding". All compilers don't have to have the same default behavior for source files to be portable.

[And as you mentioned that forces people to live in an ASCII only world - solution currently is to compile everything with /utf8]

Please don't tell people to use /utf-8. Tell them to use /source-charset:utf-8. Otherwise, their literals will be incorrectly encoded for the run-time execution encoding. I do think programmers should be using /source-charset:utf-8 if they are using the Microsoft compiler and don't have explicit reasons not to use it, but they should not be using /utf-8!

@tahonermann
Copy link
Member Author

If we want Unicode identifiers, not withstanding escape sequence we need to ensure that:

These are ABI issues and outside our purview.

struct é;
static_assert(is_same_v<unqualid(u8"é"), é>);

should be a valid program (unqualid is an utility that transforms a string into an identifier, part of the ongoing metaclasses work)

It isn't at all clear to me that unqualid should accept a u8 string.

Note that this presents an interesting issue: name_of is in the ts specified to return a NTBS in the execution encoding

I think that is probably what is desired almost all of the time.

@tahonermann
Copy link
Member Author

tahonermann commented Aug 3, 2019

WG14 paper: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1518.htm

That is the paper that brought UAX#31 into the standard wording. See [lex.name]p1. Note that the paper is simultaneously WG21 N3146.

Interesting paper but this bit

The cited text pretty much matches what we just decided for file names for P1689.

this make Unicode identifiers unusable with reflection, abi, etc

I don't agree with that conclusion.

I'm not saying people should start putting non ASCII identifiers in their interfaces but if we want to give that ability, it needs to be reliable

I do want to give programmers that ability and I agree it needs to be reliable. But I think there are multiple approaches to the problem with various pros and cons and it isn't evident to me that all implementors need to solve problems the same way.

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 3, 2019 via email

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 3, 2019 via email

@cor3ntin
Copy link
Collaborator

cor3ntin commented Aug 3, 2019 via email

@tahonermann
Copy link
Member Author

Implementation specific behaviors should be a last resort. I routinely work on 3 compilers on many platforms and I need to trust my tools. Failing to provide portable solutionns leads to people restricting themseles to the portable subset which is one of the reasons why nobody currently use Unicode identifiers. A lot of people support a lot more platforms than I do.

I think we're on the same page here. Implementation defined behavior doesn't preclude portability; sometimes it just affects the level of abstraction required.

Thanks for doing that research; that is good information. There appears to be a clear trend towards normalization, particularly in languages that didn't start off with a normalizing implementation.

@tahonermann
Copy link
Member Author

P1949 now tracks a solution for this issue.

@tahonermann tahonermann added paper revision needed An updated paper proposing a specific solution is needed and removed help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed labels Nov 17, 2019
@tahonermann
Copy link
Member Author

This issue is now tracked by cplusplus/papers#688.

@tahonermann tahonermann changed the title Improve support for Unicode characters in identifiers WG21 P1949: Improve support for Unicode characters in identifiers Mar 1, 2020
@tahonermann tahonermann added WG21-tracked This issue is tracked as a WG21 github issue and removed paper revision needed An updated paper proposing a specific solution is needed labels Mar 1, 2020
@peter-b
Copy link
Collaborator

peter-b commented Sep 16, 2021

This is done!

@peter-b peter-b closed this as completed Sep 16, 2021
@rurban
Copy link

rurban commented Jan 17, 2022

I've prepared a report and library for "C/C++ Identifier Security using Unicode Standard Annex 39", a massive improvement over TR31 alone.
See https://github.com/rurban/libu8ident/blob/master/c23%2B%2Bproposal.pdf

How do I file this officially for WG21/WG14? How do I get a P number?

@tahonermann
Copy link
Member Author

Hi @rurban. Please send a link to your proposal to the SG16 mailing list. Myself or someone else will reply with instructions for how to request a P-number and submit your proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request WG21-tracked This issue is tracked as a WG21 github issue
Development

No branches or pull requests

8 participants