forked from sg16-unicode/sg16
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add draft for UTF-16/UTF-32 literals paper (D1041R0)
Addresses sg16-unicode#6
- Loading branch information
R. Martinho Fernandes
committed
May 7, 2018
1 parent
19eb956
commit dde9936
Showing
1 changed file
with
142 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,142 @@ | ||
# Make char16_t/char32_t string literals be UTF-16/32 | ||
|
||
Document Number: P1041R0 | ||
Date: 2018-04-24 | ||
Audience: Evolution Working Group | ||
Reply-to: cpp@rmf.io | ||
|
||
## Introduction | ||
|
||
C++11 introduced character types suitable for code units of the UTF-16 and | ||
UTF-32 encoding forms, namely `char16_t` and `char32_t`. Along with this, it | ||
also introduced new string literals whose types are arrays of those two | ||
character types, prefixed with `u` and `U`, respectively. And last but not | ||
least, it also introduced *UTF-8 string literals*, prefixed with `u8`, with | ||
types arrays of `const char`. Of these three new string literal types, only one | ||
has a guarantee about the values that the elements of the array have; in other | ||
words, only one has a guaranteed encoding form, the *UTF-8 string literals*. | ||
|
||
The standard text hints that the `char16_t` and `char32_t` string literals are | ||
intended to be encoded as, respectively, UTF-16 and UTF-32, but unlike it does | ||
for *UTF-8 string literals*, it never explicitly makes such a requirement. | ||
|
||
## Motivation | ||
|
||
In defining `char16_t` string literals ([lex.string]/10), the standard makes a | ||
mention of "surrogate pairs": | ||
|
||
> A string-literal that begins with `u`, such as `u"asdf"`, is a `char16_t` | ||
> string literal. A `char16_t` string literal has type “array of *n* `const | ||
> char16_t`”, where *n* is the size of the string as defined below; it is | ||
> initialized with the given characters. A single *c-char* may produce more | ||
> than one `char16_t` character in the form of surrogate pairs. | ||
Further down, when defining the size of `char16_t` string literals | ||
([lex.string]/15), there is another mention of "surrogate pairs": | ||
|
||
> The size of a `char16_t` string literal is the total number of escape | ||
> sequences, *universal-character-names*, and other characters, plus one for | ||
> each character requiring a surrogate pair, plus one for the terminating | ||
> `u'\0'`. [*Note:* The size of a char16_t string literal is the number of | ||
> code units, not the number of characters. — *end note*] | ||
For `char32_t` string literals, the definition of their size (\[lex.string]/15) | ||
essentially limits the encoding form used to one that doesn't have more than | ||
one code unit per character: | ||
|
||
> The size of a `char32_t` or wide string literal is the total number of escape | ||
> sequences, *universal-character-names*, and other characters, plus one for | ||
> the terminating `U'\0'` or `L'\0'`. | ||
Additionally, the standard constrains the range of *universal-character-names* | ||
to the range that is supported by all of the UTF encoding forms discussed here: | ||
|
||
> Within `char32_t` and `char16_t` string literals, any | ||
> *universal-character-names* shall be within the range `0x0` to `0x10FFFF`. | ||
All of these requirements, while never explicitly naming the UTF-16 or UTF-32 | ||
encoding forms, strongly imply that these are the encoding forms intended. | ||
Furthermore, it would be questionable for an implementation to pick any other | ||
encoding forms for these string literals: there is no well-known encoding form | ||
that uses a concept named "surrogate pair" other than UTF-16, and there is no | ||
well-known encoding form that encodes each character as a single 32-bit code | ||
unit other than UTF-32. | ||
|
||
In practice, all implementations use UTF-16 and UTF-32 for these string | ||
literals. C++ should standardize this practice and make these requirements | ||
explicit instead of just hinting at them. | ||
|
||
## Proposal | ||
|
||
This proposal renames "`char16_t` string literals" and "`char32_t` string | ||
literals" to "UTF-16 string literals" and "UTF-32 string literals", to match | ||
the existing "UTF-8 string literals", and explicitly requires the object | ||
representations of those literals to be the values that correspond to the | ||
UTF-16 and UTF-32 (respectively) encodings of the given characters. | ||
|
||
## Technical Specifications | ||
|
||
- Add to [lex.string]/10: | ||
|
||
> A *string-literal* that begins with `u`, such as `u"asdf"`, is a | ||
> <del>`char16_t` string literal</del><ins>*UTF-16 string literal*</ins>. A | ||
> <del>`char16_t` string literal</del><ins>UTF-16 string literal</ins> has | ||
> type “array of *n* `const char16_t`”, where *n* is the size of the string | ||
> as defined below; it is initialized with the given characters. A single | ||
> *c-char* may produce more than one `char16_t` character in the form of | ||
> surrogate pairs. | ||
- Change [lex.string]/11: | ||
|
||
> A *string-literal* that begins with `U`, such as `U"asdf"`, is a | ||
> <del>`char32_t` string literal</del><ins>*UTF-32 string literal*</ins>. | ||
> A <del>`char32_t` string literal</del><ins>UTF-32 string literal</ins> | ||
> has type “array of *n* `const char32_t`”, where *n* is the size of the | ||
> string as defined below; it is initialized with the given characters. | ||
- Insert a paragraph between [lex.string]/10 and /11: | ||
|
||
> <ins>For a UTF-16 string literal, each successive element of the object | ||
> representation has the value of the corresponding code unit of the UTF-16 | ||
> encoding of the string.</ins> | ||
- Insert a paragraph between [lex.string]/11 and /12: | ||
|
||
> <ins>For a *UTF-32 string literal*, each successive element of the object | ||
> representation has the value of the corresponding code unit of the UTF-32 | ||
> encoding of the string.</ins> | ||
- Change [lex.ccon]/4: | ||
|
||
> A character literal that begins with the letter `u`, such as `u'x'`, is a | ||
> character literal of type `char16_t`<ins>, known as a *UTF-8 character | ||
> literal*</ins>. The value of a <del>`char16_t`</del><ins>UTF-16</ins> | ||
> character literal containing a single *c-char* is equal to its ISO 10646 | ||
> code point value, provided that the code point value is representable | ||
> with a single 16-bit code unit (that is, provided it is in the basic | ||
> multi-lingual plane). If the value is not representable with a single | ||
> 16-bit code unit, the program is ill-formed. A | ||
> <del>`char16_t`</del><ins>UTF-16</ins> character literal containing | ||
> multiple *c-char*s is ill-formed. | ||
- Change [lex.ccon]/5: | ||
|
||
> A character literal that begins with the letter `U`, such as `U'y'`, is a | ||
> character literal of type `char32_t`. The value of a | ||
> <del>`char32_t`</del><ins>UTF-32</ins> character literal containing a | ||
> single *c-char* is equal to its ISO 10646 code point value. A | ||
> <del>`char32_t`</del><ins>UTF-32</ins> character literal containing | ||
> multiple *c-char*s is ill-formed. | ||
## Interaction with other papers | ||
|
||
Currently, the standard lacks a normative reference to UTF-16, and UTF-32; | ||
however, it also lacks one such reference for UTF-8. This paper assumes that | ||
this problem will be fixed for all three encodings in another paper, | ||
potentially | ||
[D1025R0](https://github.com/sg16-unicode/sg16/blob/master/papers/D1025R0.md) | ||
(*Update The Reference To The Unicode Standard*). | ||
|
||
This paper was also written so as to not conflict with | ||
[P0482R2](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r2.html) | ||
(*char8_t: A type for UTF-8 characters and strings (Revision 2)*). |