Skip to content

Commit

Permalink
Add draft for UTF-16/UTF-32 literals paper (D1041R0)
Browse files Browse the repository at this point in the history
Addresses sg16-unicode#6
  • Loading branch information
R. Martinho Fernandes committed May 7, 2018
1 parent 19eb956 commit dde9936
Showing 1 changed file with 142 additions and 0 deletions.
142 changes: 142 additions & 0 deletions papers/d1041r0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Make char16_t/char32_t string literals be UTF-16/32

Document Number: P1041R0
Date: 2018-04-24
Audience: Evolution Working Group
Reply-to: cpp@rmf.io

## Introduction

C++11 introduced character types suitable for code units of the UTF-16 and
UTF-32 encoding forms, namely `char16_t` and `char32_t`. Along with this, it
also introduced new string literals whose types are arrays of those two
character types, prefixed with `u` and `U`, respectively. And last but not
least, it also introduced *UTF-8 string literals*, prefixed with `u8`, with
types arrays of `const char`. Of these three new string literal types, only one
has a guarantee about the values that the elements of the array have; in other
words, only one has a guaranteed encoding form, the *UTF-8 string literals*.

The standard text hints that the `char16_t` and `char32_t` string literals are
intended to be encoded as, respectively, UTF-16 and UTF-32, but unlike it does
for *UTF-8 string literals*, it never explicitly makes such a requirement.

## Motivation

In defining `char16_t` string literals ([lex.string]/10), the standard makes a
mention of "surrogate pairs":

> A string-literal that begins with `u`, such as `u"asdf"`, is a `char16_t`
> string literal. A `char16_t` string literal has type “array of *n* `const
> char16_t`”, where *n* is the size of the string as defined below; it is
> initialized with the given characters. A single *c-char* may produce more
> than one `char16_t` character in the form of surrogate pairs.
Further down, when defining the size of `char16_t` string literals
([lex.string]/15), there is another mention of "surrogate pairs":

> The size of a `char16_t` string literal is the total number of escape
> sequences, *universal-character-names*, and other characters, plus one for
> each character requiring a surrogate pair, plus one for the terminating
> `u'\0'`. [*Note:* The size of a char16_­t string literal is the number of
> code units, not the number of characters. — *end note*]
For `char32_t` string literals, the definition of their size (\[lex.string]/15)
essentially limits the encoding form used to one that doesn't have more than
one code unit per character:

> The size of a `char32_t` or wide string literal is the total number of escape
> sequences, *universal-character-names*, and other characters, plus one for
> the terminating `U'\0'` or `L'\0'`.
Additionally, the standard constrains the range of *universal-character-names*
to the range that is supported by all of the UTF encoding forms discussed here:

> Within `char32_t` and `char16_t` string literals, any
> *universal-character-names* shall be within the range `0x0` to `0x10FFFF`.
All of these requirements, while never explicitly naming the UTF-16 or UTF-32
encoding forms, strongly imply that these are the encoding forms intended.
Furthermore, it would be questionable for an implementation to pick any other
encoding forms for these string literals: there is no well-known encoding form
that uses a concept named "surrogate pair" other than UTF-16, and there is no
well-known encoding form that encodes each character as a single 32-bit code
unit other than UTF-32.

In practice, all implementations use UTF-16 and UTF-32 for these string
literals. C++ should standardize this practice and make these requirements
explicit instead of just hinting at them.

## Proposal

This proposal renames "`char16_t` string literals" and "`char32_t` string
literals" to "UTF-16 string literals" and "UTF-32 string literals", to match
the existing "UTF-8 string literals", and explicitly requires the object
representations of those literals to be the values that correspond to the
UTF-16 and UTF-32 (respectively) encodings of the given characters.

## Technical Specifications

- Add to [lex.string]/10:

> A *string-literal* that begins with `u`, such as `u"asdf"`, is a
> <del>`char16_t` string literal</del><ins>*UTF-16 string literal*</ins>. A
> <del>`char16_t` string literal</del><ins>UTF-16 string literal</ins> has
> type “array of *n* `const char16_t`”, where *n* is the size of the string
> as defined below; it is initialized with the given characters. A single
> *c-char* may produce more than one `char16_t` character in the form of
> surrogate pairs.
- Change [lex.string]/11:

> A *string-literal* that begins with `U`, such as `U"asdf"`, is a
> <del>`char32_t` string literal</del><ins>*UTF-32 string literal*</ins>.
> A <del>`char32_t` string literal</del><ins>UTF-32 string literal</ins>
> has type “array of *n* `const char32_t`”, where *n* is the size of the
> string as defined below; it is initialized with the given characters.
- Insert a paragraph between [lex.string]/10 and /11:

> <ins>For a UTF-16 string literal, each successive element of the object
> representation has the value of the corresponding code unit of the UTF-16
> encoding of the string.</ins>
- Insert a paragraph between [lex.string]/11 and /12:

> <ins>For a *UTF-32 string literal*, each successive element of the object
> representation has the value of the corresponding code unit of the UTF-32
> encoding of the string.</ins>
- Change [lex.ccon]/4:

> A character literal that begins with the letter `u`, such as `u'x'`, is a
> character literal of type `char16_t`<ins>, known as a *UTF-8 character
> literal*</ins>. The value of a <del>`char16_t`</del><ins>UTF-16</ins>
> character literal containing a single *c-char* is equal to its ISO 10646
> code point value, provided that the code point value is representable
> with a single 16-bit code unit (that is, provided it is in the basic
> multi-lingual plane). If the value is not representable with a single
> 16-bit code unit, the program is ill-formed. A
> <del>`char16_t`</del><ins>UTF-16</ins> character literal containing
> multiple *c-char*s is ill-formed.
- Change [lex.ccon]/5:

> A character literal that begins with the letter `U`, such as `U'y'`, is a
> character literal of type `char32_t`. The value of a
> <del>`char32_­t`</del><ins>UTF-32</ins> character literal containing a
> single *c-char* is equal to its ISO 10646 code point value. A
> <del>`char32_­t`</del><ins>UTF-32</ins> character literal containing
> multiple *c-char*s is ill-formed.
## Interaction with other papers

Currently, the standard lacks a normative reference to UTF-16, and UTF-32;
however, it also lacks one such reference for UTF-8. This paper assumes that
this problem will be fixed for all three encodings in another paper,
potentially
[D1025R0](https://github.com/sg16-unicode/sg16/blob/master/papers/D1025R0.md)
(*Update The Reference To The Unicode Standard*).

This paper was also written so as to not conflict with
[P0482R2](http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r2.html)
(*char8_t: A type for UTF-8 characters and strings (Revision 2)*).

0 comments on commit dde9936

Please sign in to comment.