Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non-ASCII identifiers #2457

Merged
merged 26 commits into from
Oct 29, 2018
Merged
Changes from 2 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
ec728b3
Initial draft of unicode-idents RFC
pyfisch Jun 3, 2018
4c1bda9
Include expected Usage Notes and minor changes
pyfisch Jun 4, 2018
619f5b4
Improve descriptions and fix typos
pyfisch Jun 4, 2018
142d0bc
ACII -> ASCII
pyfisch Jun 4, 2018
6b2a94a
Typos, renames and a minor reference change
pyfisch Jun 7, 2018
3e19d26
Update Reference-level explanation
pyfisch Jun 8, 2018
a4830a1
Consider identifiers for confusable detection
pyfisch Jun 9, 2018
12d0623
Note difference between Python and Rust
pyfisch Jun 10, 2018
79bbc8e
Remove mention of scope from guide explanation
pyfisch Jun 10, 2018
41f0723
Rename confusable_non_ascii_idents to confusable_idents
pyfisch Jun 10, 2018
3c96d81
Conformance statement
pyfisch Jun 12, 2018
940dab5
Remove stray "is"
pyfisch Jun 12, 2018
da43d09
Add that non-ASCII idents observe UAX31-R3
pyfisch Jun 15, 2018
0e0ca66
Add details for fs, extern, lints
pyfisch Jun 16, 2018
935c917
Add two questions about debuggers and name mangling
pyfisch Jul 10, 2018
8d548d4
Add exotic codepoint detection and mixed script lints
pyfisch Aug 15, 2018
9356fc1
+ Reusability
Manishearth Oct 15, 2018
40d53f5
Global mixed script confusables lint
Manishearth Oct 15, 2018
7732810
notable code points for less_used_codepoints
Manishearth Oct 15, 2018
e3f3692
Mention user-supplied strings
Manishearth Oct 16, 2018
d389a9c
Add unresolved Q regarding const pat confusion (rust-lang/rust#7526).
pnkfelix Oct 19, 2018
70297a9
Remove old mixed scripts lints
Manishearth Oct 19, 2018
9bf90df
Add new mixed_script_confusables lint
Manishearth Oct 19, 2018
a6da03a
Add unresolved questions for RTL and terminal width
Manishearth Oct 20, 2018
c4dff64
Allow bare underscore identifiers
Manishearth Oct 20, 2018
0c78631
RFC 2457
Centril Oct 29, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions text/0000-unicode-idents.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
- Feature Name: unicode_idents
- Start Date: 2018-06-03
- RFC PR: (leave this empty)
- Rust Issue: (leave this empty)

# Summary
[summary]: #summary

Allow non-ASCII letters (such as accented characters, Cyrillic, Greek, Kanji, etc.) in Rust identifiers.

# Motivation
[motivation]: #motivation

Rust is written by many people who are not fluent in the English language. Using identifiers in ones native language eases writing and reading code for these developers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some languages use U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER, which are not in XID_Continue. Should they be allowed too? See section 2.3 of UAX #31.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have proper support then they should be allowed, as it affects the rendering of text, which may cause it to have different meanings (I'm not sure, but it seems likely to me). However, this can cause issues like in C++ and Swift for example, different number of zero-width joiners or spaces are different identifiers, and that leads to a readability nightmare.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is why UAX #31 suggests allowing them only in certain positions, to avoid having multiple distinct identifiers look identical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So AIUI these are used for:

  • Forcing explicit viramas in Indic scripts in consonant clusters. There is a semantic difference but it's exceedingly minor and mostly comes in play for Sanskrit. It's a pretty minor difference; AIUI it's used in certain kinds of compound words and is very easily omitted

  • Certain vowel presentational forms in Bengali and Oriya. Also minor.

  • Forcing letters to take different (word-medial, etc) forms in the Perso-Arabic script, used for:

    • Arabic affixes, when shown in isolation or when used with non-arabic words. Imagine you had to write something like "Rust's" where "Rust" is in the Latin script but the 's is in Arabic, you need the Arabic suffix to not use a word-initial text form. An example of this is the ب prefix preposition

    • Arabic abbreviations

AFAICT these are all somewhat optional (though preferred). The abbreviations one might be the most used.

(that said, there are much larger problems with using an RTL script in rust)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZWNJ has important use in Persian at least https://en.m.wikipedia.org/wiki/Zero-width_non-joiner


The rationale from [PEP 3131] nicely explains it:

> ~~Python~~ *Rust* code is written by many people in the world who are not familiar with the English language, or even well-acquainted with the Latin writing system. Such developers often desire to define classes and functions with names in their native languages, rather than having to come up with an (often incorrect) English translation of the concept they want to name. By using identifiers in their native language, code clarity and maintainability of the code among speakers of that language improves.
>
> For some languages, common transliteration systems exist (in particular, for the Latin-based writing systems). For other languages, users have larger difficulties to use Latin to write their native words.

Additionally some math oriented projects may want to use identifiers closely resembling mathematical writing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NFKC removes the mathematical distinction of font style, as in ⟨ℍ⟩ vs. ⟨ℋ⟩.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dscorbett yeah the idea is that while you can use stylized characters you should not go completely overboard and use all of them in the same formula. This applies also to the distinguished in natural language orthographies.


# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation

Identifiers include variable names, function and trait names and module names. They start with a letter or an underscore and may be followed by more letters, digits and some connecting punctuation.

Examples of valid identifiers are:

* English language words: `color`, `image_width`, `line2`, `Photo`, `_unused`, ...
* ASCII words in foreign languages: `die_eisenbahn`, `el_tren`, `artikel_1_grundgesetz`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the primary beneficiaries are developers who prefer not using English identifiers, these examples wouldn’t be in foreign languages for all readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I tried to follow the spirit of the section to explain as if you were "teaching it to another Rust programmer" and since the explanation in English I found it appropriate to refer to these languages as foreign. (Do you know a better word?)

Normally I would say: "Du kannst den Variablen auch deutsche Namen geben, Umlaute funktionieren auch." 😉

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about “other”? Compare the last example, which says “other scripts” instead of “foreign scripts”.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that “non-English languages” works. But that distinction from the previous point isn’t really relevant in the first place. Identifiers made of ASCII letters that do not form a “real” word in any language are also valid. Maybe replace both points with “ASCII letters and digits”?

* words containing accented characters: `garçon`, `hühnervögel`
* identifiers in other scripts: `Москва`, `東京`, ...

Examples of invalid identifiers are:

* Keywords: `impl`, `fn`, `_` (underscore), ...
* Identifiers starting with numbers or containing "non letters": `42_the_answer`, `third√of7`, `◆◆◆`, ...
Copy link

@JelteF JelteF Jun 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually really like it if the √ symbol would be allowed as well (e.g. to white list it). In academic formula context this can make formulas in code even closer to the formulas from papers and thus easier to read in the academic context (I've actively missed this symbol when doing the same with Python 3). As an example I take the right top most formal from this picture:

let √ = f64::sqrt;
let σ² = σ.powi(2);
let exp = f64::exp;
let result = 1/√(2*π*σ²) * exp(-(1/2) * ((x - x_abs)))

instead of

let sqrt = f64::sqrt;
let σ² = σ.powi(2);
let exp = f64::exp;
let result = 1/sqrt(2*π*σ²) * exp(-(1/2) * ((x - x_abs)))

and also compare it to the most easy to read one I could think of without even using greek symbols:

let sqrt = f64::sqrt;
let sigma2 = sigma.powi(2);
let exp = f64::exp;
let result = 1/sqrt(2*pi*sigma2) * exp(-(1/2) * ((x - x_abs)/sigma))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually really like it if the √ symbol would be allowed as well

I am sympathetic to this idea but would really want to avoid to diverge from UAX31 for a single character (or a few). I would prefer if someone created a "custom operators" RFC that allowed to define additional operators. This would enable the and many more. The intuition for an identifier should still be "a letter followed by zero or more letters or digits".

* Emojis: 🙂, 🦀, 💩, ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bear in mind, lots of random things are RGI emoji in unicode, there's no single "category" for this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the emoji ⟨ℹ️⟩ would be a valid identifier.


Similar Unicode identifiers are normalized: `a1` and `a₁` (a<subscript 1>) refer to the same variable. This also applies to accented characters which can be represented in different ways.

To disallow any Unicode identifiers in a project (for example to ease collaboration or for security reasons) limiting the accepted identifiers to ASCII add this lint to the `lib.rs` or `main.rs` file of your project:

```rust
#![forbid(unicode_idents)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To split hairs: this should be non_ascii_idents.

Rust source files are always Unicode. The U+0061 to U+007A range is part of Unicode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came here to say this 😄

```

Some Unicode character look confusingly similar to each other or even identical like the Latin **A** and the Cyrillic **А**. The compiler may warn you about easy to confuse names in the same scope. If needed (but not recommended) this warning can be silenced with a `#[allow(confusable_unicode_idents)]` annotation on the enclosing function or module.

## Usage notes

All code written in the Rust Language Organization (*rustc*, tools, std, common crates) will continue to only use ASCII identifiers and the English language.

For open source crates it is recommended to write them in English and use ASCII-only. An exception should be made if the application domain (e.g. math) benefits from Unicode and the target audience (e.g. for a crate interfacing with Russian passports) is comfortable with the used language and characters. Additionally crates should provide an ASCII-only API.

Private projects can use any script and language the developer(s) desire. It is still a good idea (as with any language feature) not to overuse it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would perhaps weaken the language in these two paragraphs, with phrases like “suggested” and “should consider”. English is indeed the de-facto international language and ASCII-only idents are indeed more friendly to an international audience (if only for typing), but it is not a Rust RFC’s place to judge which other concerns are or are not acceptable reasons to do otherwise, or how much use is overuse.


# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation

Identifiers in Rust are based on the [Unicode® Standard Annex #31 Unicode Identifier and Pattern Syntax][TR31]. Rust compilers shall use at least Revision 27 of the standard.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The revision should be specified exactly. Otherwise, the same identifier in the same Rust version could be valid in one compiler but another.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to lock Rust compilers to a specific Unicode revision. An alternative would be to state the supported Unicode revision for each Rust edition.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New versions of Rust should use new versions of Unicode, of course, but any given version of Rust should have an unambiguous definition of identifiers. Your alternative is good.

Here is why the current text is wrong. Let’s say, for example, that revision 28 adds ⟨⍙⟩ to XID_Start, just in time for Rust 1.40.0. Would ⟨⍙abc⟩ be a valid identifier in Rust 1.40.0? It would be valid in compilers that use revision 28, but not in compilers that still use revision 27. Both compilers would be correct and yet disagree about something as basic as identifier validity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW we’ve already been updating Unicode for the standard library (e.g. char::to_lowercase), and have https://doc.rust-lang.org/std/char/constant.UNICODE_VERSION.html to indicate which version is in use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UAX 31 should be referenced for information/context, but I think that its revision number can be removed entirely. This RFC defines identifier syntax based on the XID_Start and XID_Continue properties, which are specified exactly for a given version of Unicode (e.g. 10.0.0).

To maintain Rust’s stability promise, we need to ensure that:

  1. If a given string is a valid identifier in a given version of Rust, it needs to stay a valid identifier in later versions of Rust.
  2. If two strings are valid identifiers in a given version of Rust, whether they compare equal after normalization needs to be unchanged in later versions of Rust.

My reading of UAX 31 and 15 is that we can do that and still update the to newer versions of Unicode in newer versions of Rust backward-compatibly without involving Rust editions. We should have some way communicate to users the version mapping, and I think that the existing std::char::UNICODE_VERSION (with docs archive like https://doc.rust-lang.org/1.15.0/std/char/constant.UNICODE_VERSION.html) is already satisfactory for that.

This translates 1. above into the requirement UAX31-R1b Stable Identifier:

To meet this requirement, an implementation shall guarantee that identifiers are stable across versions of the Unicode Standard: that is, once a string qualifies as an identifier, it does so in all future versions.

Per https://www.unicode.org/reports/tr31/#Backward_Compatibility I believe that XID_Start XID_Continue* | "_" XID_Continue+ meets that requirement. (Since XID_Start / XID_Continue include Other_ID_Start / Other_ID_Continue respectively. I’d appreciate if someone could double-check and confirm my understanding.)

Additionally, code points that are not assigned in a given Unicode version cannot be in XID_Start or XID_Continue in that Unicode version. And https://www.unicode.org/reports/tr15/#Versioning defines:

It is crucial that Normalization Forms remain stable over time. That is, if a string that does not have any unassigned characters is normalized under one version of Unicode, it must remain normalized under all future versions of Unicode.

Updating to a newer Unicode does not change the result of NFKC(X) or NFKC(Y), so NFKC(X) = NFKC(Y) is also unchanged and requirement 2. above is also met.


@dscorbett

Let’s say, for example, that revision 28 adds ⟨⍙⟩ to XID_Start, just in time for Rust 1.40.0. Would ⟨⍙abc⟩ be a valid identifier in Rust 1.40.0?

It would a Unicode version rather than in a revision of UAX 31 that adds it, but yes, if Rust 1.40.0 is the one that upgrades to that version of Unicode (regardless of when that version of Unicode was released), then ⍙abc would be a valid ident in Rust 1.40.0 but not 1.39.0. This is similar to using new a language feature newly in 1.40.0.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. It was a rhetorical question to show why specifying the Unicode version exactly is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bear in mind, the unicode annexes also do get updated every few versions, often to accommodate for new kinds of code points but sometimes to just fix things.


The lexer defines identifiers as:

> **<sup>Lexer:<sup>**
> IDENTIFIER_OR_KEYWORD:
> &nbsp;&nbsp; XID_Start&nbsp;XID_Continue<sup>\*</sup>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this compare to Go's identifier grammar: https://golang.org/ref/spec#Identifiers? It doesn't use XID_Start and XID_Continue, but instead the "Letter" and "Number, decimal digit" character classes.

Copy link
Contributor Author

@pyfisch pyfisch Jun 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XID_Start includes "Letter" but also includes "Number, Letter" (for example Roman numerals like ) and a few characters for compatibility and to ensure that XID_Start is closed under NFKC.

XID_Continue includes all of XID_Start and "Number, decimal digit" , marks to build combining characters from parts (e + ^ → ê), 10 "Punctuation, Connector" and again a few for compatibility and NFKC.

There is golang/go#194 where combining characters are discussed as they are needed to write some Asian languages. See also: https://golang.org/doc/faq#unicode_identifiers

> &nbsp;&nbsp; | `_` XID_Continue<sup>+</sup>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guide-level explanation says that identifiers may contain “more letters, digits and some connecting punctuation”, but XID_Continue does not include connecting punctuation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XID_Continue includes characters from the Pc class called "Punctuation, Connector".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it does. I must have been looking at XID_Start.

>
> IDENTIFIER :
> IDENTIFIER_OR_KEYWORD <sub>*Except a [strict] or [reserved] keyword*</sub>

`XID_Start` and `XID_Continue` are used as defined in the aforementioned standard. The definition of identifiers is forward compatible with each successive release of Unicode as only appropriate new characters are added to the classes but none are removed.

Two identifiers X, Y are considered to be equal if there [NFKC forms][TR15] are equal: NFKC(X) = NFKC(Y).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NFKC folds together some characters distinguished in natural language orthographies, such as Tifinagh ⟨ⵡ⟩ and ⟨ⵯ⟩.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/there/their/ ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d like this to go further and specify:

  • Parsers for Rust syntax normalize idents to NFKC
  • APIs such as proc_macro::Ident::new normalize to NFKC
  • As a consequence, identifiers are considered equal if their NFKC forms are equal (module hygiene concerns, which are out of scope for this RFC), and APIs such as proc_macro::Ident::to_string return a normalized string.


A `unicode_idents` lint is added to the compiler. This lint is `allow` by default. The lint checks if any identifier in the current context contains a codepoint with a value equal to or greater than 0x80 (outside ASCII range). Not only locally defined identifiers are checked but also those imported from other crates and modules into the current context.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the lint should be allow by default. It should at least be warn by default. Starting to use non ascii idents should be a conscious choice, not an accidental one.

As for checking imported idents as well, I think this should be a separate lint. You might want to be able to import something from a foreign language crate but not want to have foreign language idents in your own code.

Copy link

@shingtaklam1324 shingtaklam1324 Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reposting my comment from #2455 (comment), I think that there should be different lints, with different levels, as some languages should be allow/warn by default, and some others should be deny by default.


## Confusable detection

Rust compilers should detect confusingly similar Unicode identifiers and warn the user about it.

Note: This is *not* a mandatory for all Rust compilers as it requires considerable implementation effort and is not related to the core function of the compiler. It rather is a tool to detect accidental misspellings and intentional homograph attacks.

A new `confusable_unicode_idents` lint is added to the compiler. The default setting is `warn`.

Note: The confusable detection is set to `warn` instead of `deny` to enable forward compatibility. The list of confusable characters will be extended in the future and programs that were once valid would fail to compile.

The confusable detection algorithm is based on [Unicode® Technical Standard #39 Unicode Security Mechanisms Section 4 Confusable Detection][TR39Confusable]. For every distinct identifier X in the current scope execute the function `skeleton(X)`. If there exist two distinct identifiers X and Yin the same crate where `skeleton(X) = skeleton(Y)` report it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skeleton does not handle default-ignorable code points, but many are in XID_Continue. The lint should delete them before running skeleton.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why this would be a good idea but if we decide to do it I would like to consult with the Unicode people first to find out why they did not do it in this way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect they didn’t do it that way because UTS #39’s confusability data’s “primary goal is to include characters that would be Status=Allowed”, and default-ignorable code points are Status=Restricted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Yin -> Y in (missing space)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda want to get someone from the compiler team to approve that this is something that's definitely doable and won't adversely affect performance. This seems tricky to get right.

cc @eddyb @nikomatsakis

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already have some logic like this but we only use it in the lexer when we see a character we don't know. Assuming running skeleton exactly once per distinct identifier isn't too expensive, then everything else is the hard problem of caching/memoization aka "hashmaps" (or "hashsets", after taste). So I expect it to work out.

But that's only if the check is crate-wide with no scoping taking into account.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the current proposal involves scoping which I'm concerned about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If there exist two distinct identifiers X and Y in the same crate where" seems to suggest scoping isn't involved?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere: "The compiler may warn you about easy to confuse names in the same scope."

There's also the line before this one, which mentions scope. @pyfisch, what did you intend here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a mistake that "scope" is mentioned in the first sentence.

The skeleton for each identifier should be computed and all identifiers in a crate should be compared against each other. I assume this is easiest with a hashmap as suggested by @eddyb.

Having to build just one hashmap for the whole crate should be faster to execute and easier to code than building one for each scope. (Correct me if I am wrong.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should read:

Report a warning if there exist two distinct keywords or identifiers X and Y in, or importable into, the current crate for which skeleton(X) = skeleton(Y).

We'd forbid many identifiers as homographs of identifiers importable from std this way, so we might add some #![allow_homograph(..)] feature for overriding this, where .. is a crate name, but not the current crate.

Copy link

@burdges burdges Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the "importable into" bit partially resolves the issues for which I suggested explicitly declaring scripts.


# Drawbacks
[drawbacks]: #drawbacks

* "ASCII is enough for anyone." As source code should be written in English and in English only (source: various people) no charactes outside the ASCII range are needed to express identifiers. Therefore support for Unicode identifiers introduces unnecceray complexity to the compiler.
Copy link

@CAFxX CAFxX Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"source code should be written in English and in English only" is not an argument, it's an opinion - one that is easily shown to be partial by pointing out how ballistic people can go when e.g. variable names don't convey the correct meaning of the variable they name; now think if, because you're not a good English speaker, they convey no meaning to you at all...

My point being that obviously English speakers see no problem whatsoever with using English variable names.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kimhyunkang exposes even more reasons why this stance is terrible for diversity/inclusion in #2457 (comment)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also keep in mind that especially for younger programmers (whose native language is not English) this doesn't hold. People try out programming while they are in school and have courses on English but aren't nearly fluent yet. It's much easier to write code in your native language at that point in life, so you can focus on learning how to code instead of also needing to learn English at the same time.

Source: I'm Dutch and although all code I write now is English I have definitely written Dutch code in my younger years when I wasn't fluent enough in English yet and was learning how to code. Luckily I wasn't held back by non ASCII identifiers, because

Copy link
Contributor Author

@pyfisch pyfisch Jun 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"source code should be written in English and in English only" is not an argument, it's an opinion

That's true. But I still still hear this argument often. The underlying argument is that one needs to write in English so other people can read it. And even if you write code just for yourself you should write in English to learn it.

If you think this statement should be rephrased please suggest something else.

btw I had a similar experience as @JelteF

Copy link

@CAFxX CAFxX Jun 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying argument is that one needs to write in English so other people can read it. And even if you write code just for yourself you should write in English to learn it.

Then the underlying argument is proved false by the existence and size of huge Chinese communities where English is definitely not the primary way of expressing yourself, nor it is required to look up information.

Don't get me wrong: this obviously holds true if you're in an English-speaking community, country or organization. What I'm arguing is that the converse is not true.

And even if you write code just for yourself you should write in English to learn it.

I think this corollary is preposterous because it boils down to: "since somebody arbitrarily decided for convenience that everybody else should write code in English, you have to learn English even if you are the only person you will ever read the code you write". To see how tone deaf this argument is, replace "code" with any of "letter", "notes", "book" or "song" and then tell me if it makes sense...

* "Foreign characters are hard to type." Usually computer keyboards provide access to the US-ASCII printable characters and the local language characters. Characters from other scripts are difficult to type, require entering numeric codes or are not available at all. These characters either need to be copy-pasted or entered with an alternative input method.
* "Foreign characters are hard to read." If one is not familiar with the characters used it can be hard to tell them apart (e.g. φ and ψ) and one may not be able refer to the identifiers in an appropriate way (e.g. "loop" and "trident" instead of phi and psi)
* "My favorite terminal/text editor/web browser" has incomplete Unicode support." Even in 2018 some characters are not widely supported in all places where source code is usually displayed.
* Homoglyph attacks are possible. Without confusable detection identifiers can be distinct for the compiler but visually the same. Even with confusable detection there are still similar looking characters that may be confused by the casual reader.

# Rationale and alternatives
[alternatives]: #alternatives

As stated in [Motivation](#motivation) allowing Unicode identifiers outside the ASCII range improves Rusts accessibility for developers not working in English. Especially in teaching and when the application domain vocabulary is not in English it can be beneficial to use names from the native language. To facilitate this it is necessary to allow a wide range of Unicode character in identifiers. The proposed implementation based on the Unicode TR31 is already used by other programming languages (e.g. Python 3) and is implemented behind the `non_ascii_idents` in *rustc* but lacks the NFKC normalization proposed.

Possible variants:

1. Require all identifiers to be in NFKC or NFC form.
2. Two identifiers are only equal if their codepoints are equal.
3. Perform NFC mapping instead of NFKC mapping for identifiers.
4. Only a number of common scripts could be supported.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other possible variants are the restriction levels of UTS #39.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is listed as variant 5.

5. A [restriction level][TR39Restriction] is specified allowing only a subset of scripts and limit script-mixing within an identifier.

An alternative design would use [Immutable Identifiers][TR31Alternative] as done in [C++]. In this case a list of Unicode codepoints is reserved for syntax (ASCII operators, braces, whitespace) and all other codepoints (including currently unassigned codepoints) are allowed in identifiers. The advantages are that the compiler does not need to know the Unicode character classes XID_Start and XID_Continue for each character and that the set of allowed identifiers never changes. It is disadvantageous that all not explicitly excluded characters at the time of creation can be used in identifiers. This allows developers to create identifiers that can't be recognized as such. It also impedes other uses of Unicode in Rust syntax like custom operators if they were not initially reserved.

It always a possibility to do nothing and limit identifiers to ASCII.

It has been suggested that Unicode identifiers should be opt-in instead of opt-out. The proposal chooses opt-out to benefit the international Rust community. New Rust users should not need to search for the configuration option they may not even know exists. Additionally it simplifies tutorials in other languages as they can omit an annotation in every code snippet.

## Confusable detection

The current design was chosen because the algorithm and list of similar characters are already provided by the Unicode Consortium. A different algorithm and list of characters could be created. I am not aware of any other programming language implementing confusable detection. The confusable detection was primarily included because homoglyph attacks are a huge concern for some member of the community.

Instead of offering confusable detection the lint `forbid(unicode_idents)` is sufficient to protect project written in English from homoglyph attacks. Projects using different languages are probably either written by students, by a small group or inside a regional company. These projects are not threatened as much as large open source projects by homoglyph attacks but still benefit from the easier debugging of typos.

# Prior art
[prior-art]: #prior-art

"[Python PEP 3131][PEP 3131]: Supporting Non-ASCII Identifiers" is the Python equivalent to this proposal. The proposed identifier grammar **XID_Start&nbsp;XID_Continue<sup>\*</sup>** is identical to the one used in Python 3.

[JavaScript] supports Unicode identifiers based on the same Default Identifier Syntax but does not apply normalization.

The [CPP reference][C++] describes the allowed Unicode identifiers it is based on the immutable identifier principle.

[Java] also supports Unicode identifiers. Character must belong to a number of Unicode character classes similar to XID_start and XID_continue used in Python. Unlike in Python no normalization is performed.

The [Go language][Go] allows identifiers in the form **Letter (Letter | Number)\*** where **Letter** is a Unicode letter and **Number** is a Unicode decimal number. This is more restricted than the proposed design mainly as is does not allow combining characters needed to write some languages such as Hindi.

# Unresolved questions
[unresolved]: #unresolved-questions

* Which context is adequate for confusable detection: file, current scope, crate?
* Are Unicode characters allowed in `no_mangle` and `extern fn`s?
* How do Unicode names interact with the file system?
* Are crates with Unicode names allowed and can they be published to crates.io?
* Are `unicode_idents` and `confusable_unicode_idents` good names?
* Should [ZWNJ and ZWJ be allowed in identifiers][TR31Layout]?
* Should *rustc* accept files in a different encoding than *UTF-8*?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is easy: no. Why would it? UTF-8 by definition supports all of Unicode. Also, source files encodings seem off-topic for this RFC.


[PEP 3131]: https://www.python.org/dev/peps/pep-3131/
[TR15]: https://www.unicode.org/reports/tr15/
[TR31]: http://www.unicode.org/reports/tr31/
[TR31Alternative]: http://unicode.org/reports/tr31/#Alternative_Identifier_Syntax
[TR31Layout]: https://www.unicode.org/reports/tr31/#Layout_and_Format_Control_Characters
[TR39Confusable]: https://www.unicode.org/reports/tr39/#Confusable_Detection
[TR39Restriction]: https://www.unicode.org/reports/tr39/#Restriction_Level_Detection
[C++]: https://en.cppreference.com/w/cpp/language/identifiers
[Julia Unicode PR]: https://github.com/JuliaLang/julia/pull/19464
[Java]: https://docs.oracle.com/javase/specs/jls/se10/html/jls-3.html#jls-3.8
[JavaScript]: http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords
[Go]: https://golang.org/ref/spec#Identifiers