Add paper to reword stuff related to ISO 10646

Addresses sg16-unicode#8
rmartinho · Jun 28, 2018 · c25df77 · c25df77
1 parent a322cc2
commit c25df77
Showing 1 changed file with 124 additions and 0 deletions.
diff --git a/papers/d0000_reword_for_10646.md b/papers/d0000_reword_for_10646.md
@@ -0,0 +1,124 @@
+# Address wording issues related to ISO 10646
+
+Document Number: DnnnnR0  
+Date: 2018-06-28  
+Audience: SG16, CWG
+Author: R. Martinho Fernandes
+Reply-to: cpp@rmf.io
+
+## Motivation
+
+Review of some editorial fixes following the recent update of the normative
+reference to ISO 10646 has unearthed a series of wording issues around the
+subject. This paper intends to fix those issues by rewording relevant
+paragraphs.
+
+## Proposal
+
+This paper addresses all of the following issues:
+
+1. The current wording in [lex.charset] does not specify what the behaviour is
+   for a universal-character-name without a corresponding short identifier in
+   ISO 10646.
+
+       For example, `\U99004141` and `\U00110000`. Neither of these designates
+       a code point in ISO 10646, but the standard is silent about this, which
+       makes the behaviour undefined by omission.
+
+       This paper addresses this by making such uses ill-formed, maintaining
+       consistency with the current treatment of surrogate values (`\U0000D800`
+       is already ill-formed).
+
+2. The current wording in [lex.charset] uses "hexadecimal value", which is
+   confusing because a value is just a number, and hexadecimal is just way to
+   represent numbers; "value" alone should suffice.
+
+       This paper addresses this by removing the need for this term.
+
+3. There is some interest in using the U+ notation (as in U+0041 or U+1F34A) to
+   refer to Unicode code points across the entire standard.
+
+       This paper changes all the relevant wording to use U+ notation.
+
+4. The current text includes explanations of terms from ISO 10646 (like
+   "surrogate code point" or "control character") in normative text, which is
+   undesirable.
+
+       This paper moves such explanations to non-normative text.
+
+## Technical Specifications
+
+In this description, text that should be deleted is marked red and striked out;
+text that should be added is marked green and underlined. Apply these changes
+on top of the editorial fix provided in [PR #2201].
+
+Edit 5.3 [lex.charset], paragraph 2 as follows.
+
+> <sup>2</sup> The *universal-character-name* construct provides a way to name
+> other characters.
+>
+> &nbsp;&nbsp;&nbsp;&nbsp;*hex-quad:*
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit*
+>
+> &nbsp;&nbsp;&nbsp;&nbsp;*universal-character-name:*
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*\u hex-quad*
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*\U hex-quad hex-quad*
+> 
+> The character designated by the *universal-character-name* `\U00NNNNNN` is
+> that character whose <del>character</del><ins>code point</ins> short
+> identifier in ISO/IEC 10646 is <del>`NNNNNN`</del><ins>U+NNNNNN</ins>; the
+> character designated by the *universal-character-name* `\uNNNN` is that
+> character whose <del>character</del><ins>code point</ins> short identifier in
+> ISO/IEC 10646 is <del>`NNNN`</del><ins>U+NNNN</ins>.  If <del>the hexadecimal
+> value for a *universal-character-name* corresponds to a surrogate code point
+> (in the range 0xD800–0xDFFF, inclusive)</del><ins>If a
+> *universal-character-name* does not correspond to any character in ISO/IEC
+> 10646 [*Note*&mdash;ISO/IEC 10646 code points are within the range
+> 0x0-0x10FFFF, inclusive.&mdash;*end note*] or if a
+> *universal-character-name* corresponds to a surrogate code point
+> [*Note*&mdash;A surrogate code point is a value in the range 0xD800-0xDFFF,
+> inclusive.&mdash;*end note*]</ins>, the program is ill-formed.
+> Additionally, if <del>the hexadecimal value for</del> a
+> *universal-character-name* outside the *c-char-sequence*, *s-char-sequence*,
+> or *r-char-sequence* of a character or string literal corresponds to a
+> control character <del>(</del><ins>[*Note*&mdash;A control character is a
+> character </ins>in either of the ranges 0x00–0x1F or 0x7F–0x9F, both
+> inclusive<del>)</del><ins>&mdash;*end note*]</ins> or to a character in the
+> basic source character set, the program is ill-formed.
+
+Edit 5.13.3 [lex.ccon], paragraph 3 as follows.
+
+> <sup>3</sup> A character literal that begins with `u8`, such as `u8'w'`, is a
+> character literal of type `char`, known as a *UTF-8 character literal*.  The
+> value of a UTF-8 character literal is equal to its ISO 10646 code point
+> value, provided that the code point value is representable with a single
+> UTF-8 code unit <del>(that is, provided it is in the C0 Controls and Basic
+> Latin Unicode block)</del><ins>[*Note*&mdash;this is true if and only if a
+> code point is in the range 0x0-0x7F, inclusive&mdash;*end note*]</ins>. If
+> the value is not representable with a single UTF-8 code unit, the program is
+> ill-formed. A UTF-8 character literal containing multiple *c-chars* is
+> ill-formed.
+
+Edit 5.13.3 [lex.string], paragraph 10 as follows.
+
+> <sup>10</sup> A *string-literal* that begins with `u`, such as `u"asdf"`, is
+> a `char16_t` string literal.  A `char16_t` string literal has type “array of
+> *n* `const char16_t`”, where *n* is the size of the string as defined below;
+> it is initialized with the given characters.  A single *c-char* may produce
+> more than one `char16_t` character in the form of surrogate pairs
+> <ins>[*Note*&mdash; a surrogate pair is a representation representation for a
+> single character as a sequence of two 16-bit code units&mdash;*end
+> note*]</ins>.
+
+Edit 19.8 [cpp.predefined], item (2.4) as follows.
+
+> <sup>(2.4)</sup> &mdash;`__STDC_ISO_10646__`  
+> An integer literal of the form `yyyymmL` (for example,
+> `199712L`).  If this symbol is defined, then every character in the Unicode
+> required set, when stored in an object of type `wchar_t`, has the same value
+> as the <del>short identifier</del><ins>code point</ins> of that character.
+> The Unicode required set consists of all the characters that are defined by
+> ISO/IEC 10646, along with all amendments and technical corrigenda as of the
+> specified year and month.
+
+ [PR #2201]: https://github.com/cplusplus/draft/pull/2201