Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lex.charset] Change "short name" to "short identifier" to match ISO 10646 #2201

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions source/lex.tex
Original file line number Diff line number Diff line change
Expand Up @@ -208,10 +208,10 @@
\end{bnf}

The character designated by the \grammarterm{universal-character-name} \tcode{\textbackslash
UNNNNNNNN} is that character whose character short name in ISO/IEC 10646 is
\tcode{NNNNNNNN}; the character designated by the \grammarterm{universal-character-name}
\tcode{\textbackslash uNNNN} is that character whose character short name in
ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a
U00NNNNNN} is that character whose character short identifier in ISO/IEC 10646 is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rewording has lost the specification for a universal-character-name beginning with \U01 (etc). I think we need a normative change to properly address this -- it doesn't seem right to just remove the specification for these cases, but the old specification is clearly wrong, as there is no character with the specified short identifier.

\tcode{NNNNNN}; the character designated by the \grammarterm{universal-character-name}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if I write \U00000041 in my source code? Does the character short identifier "000041" exist in Unicode? If it does exist, what about \u0041; is the character short identifier here "0041"? Why are there two identifiers naming the same thing? What about lowercase vs. uppercase hex digits? Should we refer to the hexadecimal value somehow?

Copy link
Author

@rmartinho rmartinho Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the short identifier concept gives more than one identifier for each character. For the character with scalar value 0x4A, all of the following are valid short identifiers: 00004A, 004A, +00004A, +004A, U00004A, U004A, U+00004A, U+004A, 00004a, 004a, +00004a, +004a, U00004a, U004a, U+00004a, U+004a, u00004A, u004A, u+00004A, u+004A, u00004a, u004a, u+00004a, u+004a. Any of those unambiguously identifies the same character. If there were more A-F digits, there would be even more possible identifiers (there are 384 possible short identifiers for 0xAAAAA). The syntax is given with the description I quoted above, and also with the following BNF:

{ U | u } {+}(xxxx | xxxxx | xxxxxx)

where “x” represents one hexadecimal digit (0 to 9, A to F, or a to f), and with the additional requirement that the 5-digit form is not allowed to have leading zeros (so 0041 and 000041 are both valid, but 00041 isn't). I don't know why the choice was made to have this much flexibility.

Referring to the hexadecimal value may actually be a better choice; that wording is actually used in the very next sentence to forbid surrogates. If we want to do it this way, I would rewrite in the following manner.

The character designated by the universal-character-name \UNNNNNNNN is that character whose code point in ISO/IEC 10646 has the hexadecimal value NNNNNNNN; the character designated by the universal-character-name \uNNNN is that character whose code point in ISO/IEC 10646 has the hexadecimal value 0000NNNN.

(That last bit can also be "has the hexadecimal value NNNN", without the leading zeros.)

Also for clarity, ISO 10646 defines "code point" as "value in the UCS codespace", (UCS being short for the character set specified by ISO 10646).

I originally just did s/short name/short identifier/ because that produced minimal changes, but I can rephrase it in terms of hexadecimal value as above if that's preferred.

\tcode{\textbackslash uNNNN} is that character whose character short identifier in
ISO/IEC 10646 is \tcode{NNNN}. If the hexadecimal value for a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the driveby, but why do we say "hexadecimal value"? Why not just "value"? In which way does the value depend on a particular serialization format?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep that separate, please? This is enough of a tar pit already, and might benefit from a more wholesale rework.

\grammarterm{universal-character-name} corresponds to a surrogate code point (in the
range 0xD800--0xDFFF, inclusive), the program is ill-formed. Additionally, if
the hexadecimal value for a \grammarterm{universal-character-name} outside
Expand Down