diff --git a/papers/d1041r0.md b/papers/d1041r0.md new file mode 100644 index 0000000..f872630 --- /dev/null +++ b/papers/d1041r0.md @@ -0,0 +1,142 @@ +# Make char16_t/char32_t string literals be UTF-16/32 + +Document Number: P1041R0 +Date: 2018-04-24 +Audience: Evolution Working Group +Reply-to: cpp@rmf.io + +## Introduction + +C++11 introduced character types suitable for code units of the UTF-16 and +UTF-32 encoding forms, namely `char16_t` and `char32_t`. Along with this, it +also introduced new string literals whose types are arrays of those two +character types, prefixed with `u` and `U`, respectively. And last but not +least, it also introduced *UTF-8 string literals*, prefixed with `u8`, with +types arrays of `const char`. Of these three new string literal types, only one +has a guarantee about the values that the elements of the array have; in other +words, only one has a guaranteed encoding form, the *UTF-8 string literals*. + +The standard text hints that the `char16_t` and `char32_t` string literals are +intended to be encoded as, respectively, UTF-16 and UTF-32, but unlike it does +for *UTF-8 string literals*, it never explicitly makes such a requirement. + +## Motivation + +In defining `char16_t` string literals ([lex.string]/10), the standard makes a +mention of "surrogate pairs": + +> A string-literal that begins with `u`, such as `u"asdf"`, is a `char16_t` +> string literal. A `char16_t` string literal has type “array of *n* `const +> char16_t`”, where *n* is the size of the string as defined below; it is +> initialized with the given characters. A single *c-char* may produce more +> than one `char16_t` character in the form of surrogate pairs. + +Further down, when defining the size of `char16_t` string literals +([lex.string]/15), there is another mention of "surrogate pairs": + +> The size of a `char16_t` string literal is the total number of escape +> sequences, *universal-character-names*, and other characters, plus one for +> each character requiring a surrogate pair, plus one for the terminating +> `u'\0'`. [*Note:* The size of a char16_­t string literal is the number of +> code units, not the number of characters. — *end note*] + +For `char32_t` string literals, the definition of their size (\[lex.string]/15) +essentially limits the encoding form used to one that doesn't have more than +one code unit per character: + +> The size of a `char32_t` or wide string literal is the total number of escape +> sequences, *universal-character-names*, and other characters, plus one for +> the terminating `U'\0'` or `L'\0'`. + +Additionally, the standard constrains the range of *universal-character-names* +to the range that is supported by all of the UTF encoding forms discussed here: + +> Within `char32_t` and `char16_t` string literals, any +> *universal-character-names* shall be within the range `0x0` to `0x10FFFF`. + +All of these requirements, while never explicitly naming the UTF-16 or UTF-32 +encoding forms, strongly imply that these are the encoding forms intended. +Furthermore, it would be questionable for an implementation to pick any other +encoding forms for these string literals: there is no well-known encoding form +that uses a concept named "surrogate pair" other than UTF-16, and there is no +well-known encoding form that encodes each character as a single 32-bit code +unit other than UTF-32. + +In practice, all implementations use UTF-16 and UTF-32 for these string +literals. C++ should standardize this practice and make these requirements +explicit instead of just hinting at them. + +## Proposal + +This proposal renames "`char16_t` string literals" and "`char32_t` string +literals" to "UTF-16 string literals" and "UTF-32 string literals", to match +the existing "UTF-8 string literals", and explicitly requires the object +representations of those literals to be the values that correspond to the +UTF-16 and UTF-32 (respectively) encodings of the given characters. + +## Technical Specifications + + - Add to [lex.string]/10: + + > A *string-literal* that begins with `u`, such as `u"asdf"`, is a + > `char16_t` string literal*UTF-16 string literal*. A + > `char16_t` string literalUTF-16 string literal has + > type “array of *n* `const char16_t`”, where *n* is the size of the string + > as defined below; it is initialized with the given characters. A single + > *c-char* may produce more than one `char16_t` character in the form of + > surrogate pairs. + + - Change [lex.string]/11: + + > A *string-literal* that begins with `U`, such as `U"asdf"`, is a + > `char32_t` string literal*UTF-32 string literal*. + > A `char32_t` string literalUTF-32 string literal + > has type “array of *n* `const char32_t`”, where *n* is the size of the + > string as defined below; it is initialized with the given characters. + + - Insert a paragraph between [lex.string]/10 and /11: + + > For a UTF-16 string literal, each successive element of the object + > representation has the value of the corresponding code unit of the UTF-16 + > encoding of the string. + +- Insert a paragraph between [lex.string]/11 and /12: + + > For a *UTF-32 string literal*, each successive element of the object + > representation has the value of the corresponding code unit of the UTF-32 + > encoding of the string. + +- Change [lex.ccon]/4: + + > A character literal that begins with the letter `u`, such as `u'x'`, is a + > character literal of type `char16_t`, known as a *UTF-8 character + > literal*. The value of a `char16_t`UTF-16 + > character literal containing a single *c-char* is equal to its ISO 10646 + > code point value, provided that the code point value is representable + > with a single 16-bit code unit (that is, provided it is in the basic + > multi-lingual plane). If the value is not representable with a single + > 16-bit code unit, the program is ill-formed. A + > `char16_t`UTF-16 character literal containing + > multiple *c-char*s is ill-formed. + +- Change [lex.ccon]/5: + + > A character literal that begins with the letter `U`, such as `U'y'`, is a + > character literal of type `char32_t`. The value of a + > `char32_­t`UTF-32 character literal containing a + > single *c-char* is equal to its ISO 10646 code point value. A + > `char32_­t`UTF-32 character literal containing + > multiple *c-char*s is ill-formed. + +## Interaction with other papers + +Currently, the standard lacks a normative reference to UTF-16, and UTF-32; +however, it also lacks one such reference for UTF-8. This paper assumes that +this problem will be fixed for all three encodings in another paper, +potentially +[D1025R0](https://github.com/sg16-unicode/sg16/blob/master/papers/D1025R0.md) +(*Update The Reference To The Unicode Standard*). + +This paper was also written so as to not conflict with +[P0482R2](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r2.html) +(*char8_t: A type for UTF-8 characters and strings (Revision 2)*).