Add draft for UTF-16/UTF-32 literals paper (D1041R0)

Addresses sg16-unicode#6
rmartinho · May 7, 2018 · dde9936 · dde9936
1 parent 19eb956
commit dde9936
Showing 1 changed file with 142 additions and 0 deletions.
diff --git a/papers/d1041r0.md b/papers/d1041r0.md
@@ -0,0 +1,142 @@
+# Make char16_t/char32_t string literals be UTF-16/32
+
+Document Number: P1041R0  
+Date: 2018-04-24  
+Audience: Evolution Working Group  
+Reply-to: cpp@rmf.io
+
+## Introduction
+
+C++11 introduced character types suitable for code units of the UTF-16 and
+UTF-32 encoding forms, namely `char16_t` and `char32_t`. Along with this, it
+also introduced new string literals whose types are arrays of those two
+character types, prefixed with `u` and `U`, respectively. And last but not
+least, it also introduced *UTF-8 string literals*, prefixed with `u8`, with
+types arrays of `const char`. Of these three new string literal types, only one
+has a guarantee about the values that the elements of the array have; in other
+words, only one has a guaranteed encoding form, the *UTF-8 string literals*.
+
+The standard text hints that the `char16_t` and `char32_t` string literals are
+intended to be encoded as, respectively, UTF-16 and UTF-32, but unlike it does
+for *UTF-8 string literals*, it never explicitly makes such a requirement.
+
+## Motivation
+
+In defining `char16_t` string literals ([lex.string]/10), the standard makes a
+mention of "surrogate pairs":
+
+> A string-literal that begins with `u`, such as `u"asdf"`, is a `char16_t`
+> string literal.  A `char16_t` string literal has type “array of *n* `const
+> char16_t`”, where *n* is the size of the string as defined below; it is
+> initialized with the given characters. A single *c-char* may produce more
+> than one `char16_t` character in the form of surrogate pairs.
+
+Further down, when defining the size of `char16_t` string literals
+([lex.string]/15), there is another mention of "surrogate pairs":
+
+> The size of a `char16_t` string literal is the total number of escape
+> sequences, *universal-character-names*, and other characters, plus one for
+> each character requiring a surrogate pair, plus one for the terminating
+> `u'\0'`.  [*Note:* The size of a char16_t string literal is the number of
+> code units, not the number of characters. — *end note*]
+
+For `char32_t` string literals, the definition of their size (\[lex.string]/15)
+essentially limits the encoding form used to one that doesn't have more than
+one code unit per character:
+
+> The size of a `char32_t` or wide string literal is the total number of escape
+> sequences, *universal-character-names*, and other characters, plus one for
+> the terminating `U'\0'` or `L'\0'`.
+
+Additionally, the standard constrains the range of *universal-character-names*
+to the range that is supported by all of the UTF encoding forms discussed here:
+
+> Within `char32_t` and `char16_t` string literals, any
+> *universal-character-names* shall be within the range `0x0` to `0x10FFFF`.
+
+All of these requirements, while never explicitly naming the UTF-16 or UTF-32
+encoding forms, strongly imply that these are the encoding forms intended.
+Furthermore, it would be questionable for an implementation to pick any other
+encoding forms for these string literals: there is no well-known encoding form
+that uses a concept named "surrogate pair" other than UTF-16, and there is no
+well-known encoding form that encodes each character as a single 32-bit code
+unit other than UTF-32.
+
+In practice, all implementations use UTF-16 and UTF-32 for these string
+literals. C++ should standardize this practice and make these requirements
+explicit instead of just hinting at them.
+
+## Proposal
+
+This proposal renames "`char16_t` string literals" and "`char32_t` string
+literals" to "UTF-16 string literals" and "UTF-32 string literals", to match
+the existing "UTF-8 string literals", and explicitly requires the object
+representations of those literals to be the values that correspond to the
+UTF-16 and UTF-32 (respectively) encodings of the given characters.
+
+## Technical Specifications
+
+ - Add to [lex.string]/10:
+
+    > A *string-literal* that begins with `u`, such as `u"asdf"`, is a
+    > <del>`char16_t` string literal</del><ins>*UTF-16 string literal*</ins>. A
+    > <del>`char16_t` string literal</del><ins>UTF-16 string literal</ins> has
+    > type “array of *n* `const char16_t`”, where *n* is the size of the string
+    > as defined below; it is initialized with the given characters. A single
+    > *c-char* may produce more than one `char16_t` character in the form of
+    > surrogate pairs.
+
+ - Change [lex.string]/11:
+
+    > A *string-literal* that begins with `U`, such as `U"asdf"`, is a
+    > <del>`char32_t` string literal</del><ins>*UTF-32 string literal*</ins>.
+    > A <del>`char32_t` string literal</del><ins>UTF-32 string literal</ins>
+    > has type “array of *n* `const char32_t`”, where *n* is the size of the
+    > string as defined below; it is initialized with the given characters.
+
+ - Insert a paragraph between [lex.string]/10 and /11:
+
+    > <ins>For a UTF-16 string literal, each successive element of the object
+    > representation has the value of the corresponding code unit of the UTF-16
+    > encoding of the string.</ins>
+
+- Insert a paragraph between [lex.string]/11 and /12:
+
+    > <ins>For a *UTF-32 string literal*, each successive element of the object
+    > representation has the value of the corresponding code unit of the UTF-32
+    > encoding of the string.</ins>
+
+- Change [lex.ccon]/4:
+
+    > A character literal that begins with the letter `u`, such as `u'x'`, is a
+    > character literal of type `char16_t`<ins>, known as a *UTF-8 character
+    > literal*</ins>. The value of a <del>`char16_t`</del><ins>UTF-16</ins>
+    > character literal containing a single *c-char* is equal to its ISO 10646
+    > code point value, provided that the code point value is representable
+    > with a single 16-bit code unit (that is, provided it is in the basic
+    > multi-lingual plane). If the value is not representable with a single
+    > 16-bit code unit, the program is ill-formed. A
+    > <del>`char16_t`</del><ins>UTF-16</ins> character literal containing
+    > multiple *c-char*s is ill-formed.
+
+- Change [lex.ccon]/5:
+
+    > A character literal that begins with the letter `U`, such as `U'y'`, is a
+    > character literal of type `char32_t`. The value of a
+    > <del>`char32_t`</del><ins>UTF-32</ins> character literal containing a
+    > single *c-char* is equal to its ISO 10646 code point value. A
+    > <del>`char32_t`</del><ins>UTF-32</ins> character literal containing
+    > multiple *c-char*s is ill-formed.
+
+## Interaction with other papers
+
+Currently, the standard lacks a normative reference to UTF-16, and UTF-32;
+however, it also lacks one such reference for UTF-8. This paper assumes that
+this problem will be fixed for all three encodings in another paper,
+potentially
+[D1025R0](https://github.com/sg16-unicode/sg16/blob/master/papers/D1025R0.md)
+(*Update The Reference To The Unicode Standard*).
+
+This paper was also written so as to not conflict with
+[P0482R2](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r2.html)
+(*char8_t: A type for UTF-8 characters and strings (Revision 2)*).