Skip to content

Commit

Permalink
Add paper to reword stuff related to ISO 10646
Browse files Browse the repository at this point in the history
Addresses sg16-unicode#8
  • Loading branch information
rmartinho committed Jun 28, 2018
1 parent a322cc2 commit c25df77
Showing 1 changed file with 124 additions and 0 deletions.
124 changes: 124 additions & 0 deletions papers/d0000_reword_for_10646.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Address wording issues related to ISO 10646

Document Number: DnnnnR0
Date: 2018-06-28
Audience: SG16, CWG
Author: R. Martinho Fernandes
Reply-to: cpp@rmf.io

## Motivation

Review of some editorial fixes following the recent update of the normative
reference to ISO 10646 has unearthed a series of wording issues around the
subject. This paper intends to fix those issues by rewording relevant
paragraphs.

## Proposal

This paper addresses all of the following issues:

1. The current wording in [lex.charset] does not specify what the behaviour is
for a universal-character-name without a corresponding short identifier in
ISO 10646.

For example, `\U99004141` and `\U00110000`. Neither of these designates
a code point in ISO 10646, but the standard is silent about this, which
makes the behaviour undefined by omission.

This paper addresses this by making such uses ill-formed, maintaining
consistency with the current treatment of surrogate values (`\U0000D800`
is already ill-formed).

2. The current wording in [lex.charset] uses "hexadecimal value", which is
confusing because a value is just a number, and hexadecimal is just way to
represent numbers; "value" alone should suffice.

This paper addresses this by removing the need for this term.

3. There is some interest in using the U+ notation (as in U+0041 or U+1F34A) to
refer to Unicode code points across the entire standard.

This paper changes all the relevant wording to use U+ notation.

4. The current text includes explanations of terms from ISO 10646 (like
"surrogate code point" or "control character") in normative text, which is
undesirable.

This paper moves such explanations to non-normative text.

## Technical Specifications

In this description, text that should be deleted is marked red and striked out;
text that should be added is marked green and underlined. Apply these changes
on top of the editorial fix provided in [PR #2201].

Edit 5.3 [lex.charset], paragraph 2 as follows.

> <sup>2</sup> The *universal-character-name* construct provides a way to name
> other characters.
>
> &nbsp;&nbsp;&nbsp;&nbsp;*hex-quad:*
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit*
>
> &nbsp;&nbsp;&nbsp;&nbsp;*universal-character-name:*
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*\u hex-quad*
> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;*\U hex-quad hex-quad*
>
> The character designated by the *universal-character-name* `\U00NNNNNN` is
> that character whose <del>character</del><ins>code point</ins> short
> identifier in ISO/IEC 10646 is <del>`NNNNNN`</del><ins>U+NNNNNN</ins>; the
> character designated by the *universal-character-name* `\uNNNN` is that
> character whose <del>character</del><ins>code point</ins> short identifier in
> ISO/IEC 10646 is <del>`NNNN`</del><ins>U+NNNN</ins>. If <del>the hexadecimal
> value for a *universal-character-name* corresponds to a surrogate code point
> (in the range 0xD800–0xDFFF, inclusive)</del><ins>If a
> *universal-character-name* does not correspond to any character in ISO/IEC
> 10646 [*Note*&mdash;ISO/IEC 10646 code points are within the range
> 0x0-0x10FFFF, inclusive.&mdash;*end note*] or if a
> *universal-character-name* corresponds to a surrogate code point
> [*Note*&mdash;A surrogate code point is a value in the range 0xD800-0xDFFF,
> inclusive.&mdash;*end note*]</ins>, the program is ill-formed.
> Additionally, if <del>the hexadecimal value for</del> a
> *universal-character-name* outside the *c-char-sequence*, *s-char-sequence*,
> or *r-char-sequence* of a character or string literal corresponds to a
> control character <del>(</del><ins>[*Note*&mdash;A control character is a
> character </ins>in either of the ranges 0x00–0x1F or 0x7F–0x9F, both
> inclusive<del>)</del><ins>&mdash;*end note*]</ins> or to a character in the
> basic source character set, the program is ill-formed.
Edit 5.13.3 [lex.ccon], paragraph 3 as follows.

> <sup>3</sup> A character literal that begins with `u8`, such as `u8'w'`, is a
> character literal of type `char`, known as a *UTF-8 character literal*. The
> value of a UTF-8 character literal is equal to its ISO 10646 code point
> value, provided that the code point value is representable with a single
> UTF-8 code unit <del>(that is, provided it is in the C0 Controls and Basic
> Latin Unicode block)</del><ins>[*Note*&mdash;this is true if and only if a
> code point is in the range 0x0-0x7F, inclusive&mdash;*end note*]</ins>. If
> the value is not representable with a single UTF-8 code unit, the program is
> ill-formed. A UTF-8 character literal containing multiple *c-chars* is
> ill-formed.
Edit 5.13.3 [lex.string], paragraph 10 as follows.

> <sup>10</sup> A *string-literal* that begins with `u`, such as `u"asdf"`, is
> a `char16_t` string literal. A `char16_t` string literal has type “array of
> *n* `const char16_t`”, where *n* is the size of the string as defined below;
> it is initialized with the given characters. A single *c-char* may produce
> more than one `char16_t` character in the form of surrogate pairs
> <ins>[*Note*&mdash; a surrogate pair is a representation representation for a
> single character as a sequence of two 16-bit code units&mdash;*end
> note*]</ins>.
Edit 19.8 [cpp.predefined], item (2.4) as follows.

> <sup>(2.4)</sup> &mdash;`__STDC_ISO_10646__`
> An integer literal of the form `yyyymmL` (for example,
> `199712L`). If this symbol is defined, then every character in the Unicode
> required set, when stored in an object of type `wchar_t`, has the same value
> as the <del>short identifier</del><ins>code point</ins> of that character.
> The Unicode required set consists of all the characters that are defined by
> ISO/IEC 10646, along with all amendments and technical corrigenda as of the
> specified year and month.
[PR #2201]: https://github.com/cplusplus/draft/pull/2201

0 comments on commit c25df77

Please sign in to comment.