-
Notifications
You must be signed in to change notification settings - Fork 749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lex.charset] Change "short name" to "short identifier" to match ISO 10646 #2201
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -208,10 +208,10 @@ | |
\end{bnf} | ||
|
||
The character designated by the \grammarterm{universal-character-name} \tcode{\textbackslash | ||
UNNNNNNNN} is that character whose character short name in ISO/IEC 10646 is | ||
\tcode{NNNNNNNN}; the character designated by the \grammarterm{universal-character-name} | ||
\tcode{\textbackslash uNNNN} is that character whose character short name in | ||
ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a | ||
U00NNNNNN} is that character whose character short identifier in ISO/IEC 10646 is | ||
\tcode{NNNNNN}; the character designated by the \grammarterm{universal-character-name} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if I write There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the short identifier concept gives more than one identifier for each character. For the character with scalar value 0x4A, all of the following are valid short identifiers: 00004A, 004A, +00004A, +004A, U00004A, U004A, U+00004A, U+004A, 00004a, 004a, +00004a, +004a, U00004a, U004a, U+00004a, U+004a, u00004A, u004A, u+00004A, u+004A, u00004a, u004a, u+00004a, u+004a. Any of those unambiguously identifies the same character. If there were more A-F digits, there would be even more possible identifiers (there are 384 possible short identifiers for 0xAAAAA). The syntax is given with the description I quoted above, and also with the following BNF:
where “x” represents one hexadecimal digit (0 to 9, A to F, or a to f), and with the additional requirement that the 5-digit form is not allowed to have leading zeros (so 0041 and 000041 are both valid, but 00041 isn't). I don't know why the choice was made to have this much flexibility. Referring to the hexadecimal value may actually be a better choice; that wording is actually used in the very next sentence to forbid surrogates. If we want to do it this way, I would rewrite in the following manner.
(That last bit can also be "has the hexadecimal value NNNN", without the leading zeros.) Also for clarity, ISO 10646 defines "code point" as "value in the UCS codespace", (UCS being short for the character set specified by ISO 10646). I originally just did s/short name/short identifier/ because that produced minimal changes, but I can rephrase it in terms of hexadecimal value as above if that's preferred. |
||
\tcode{\textbackslash uNNNN} is that character whose character short identifier in | ||
ISO/IEC 10646 is \tcode{NNNN}. If the hexadecimal value for a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry for the driveby, but why do we say "hexadecimal value"? Why not just "value"? In which way does the value depend on a particular serialization format? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we keep that separate, please? This is enough of a tar pit already, and might benefit from a more wholesale rework. |
||
\grammarterm{universal-character-name} corresponds to a surrogate code point (in the | ||
range 0xD800--0xDFFF, inclusive), the program is ill-formed. Additionally, if | ||
the hexadecimal value for a \grammarterm{universal-character-name} outside | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rewording has lost the specification for a universal-character-name beginning with
\U01
(etc). I think we need a normative change to properly address this -- it doesn't seem right to just remove the specification for these cases, but the old specification is clearly wrong, as there is no character with the specified short identifier.