Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new lint: Mixed locale ident #7376

Closed
wants to merge 16 commits into from

Conversation

popzxc
Copy link
Contributor

@popzxc popzxc commented Jun 19, 2021

This PR adds a new lint to check that the identifier name has multiple locales.

I think that it's not the thing that must happen normally, as it both makes hand-writing the code much harder, and can lead to confusing errors (rustc's built-in lint mixed_script_confusables can be implicitly shadowed, which makes it not really reliable).

stderr example:

error: multiple locales used in identifier `Blоck`: Cyrillic, Latin
  --> $DIR/mixed_locale_idents.rs:5:12
   |
LL | pub struct Blоck;
   |            ^^^^^
   |
   = note: `-D clippy::mixed-locale-idents` implied by `-D warnings`

error: multiple locales used in identifier `black_чёрный_黒い_काला`: Devanagari, Cyrillic, Hiragana, Han, Latin
  --> $DIR/mixed_locale_idents.rs:8:9
   |
LL |     let black_чёрный_黒い_काला = "good luck hand-writing it";
   |         ^^^^^^^^^^^^^^^^^^^^^^

error: aborting due to 2 previous errors

Please write a short comment explaining your change (or "none" for internal only changes)

changelog: [`mixed_locale_idents`]

@rust-highfive
Copy link

r? @phansch

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties label Jun 19, 2021
@xFrednet
Copy link
Member

What would the lint say about a value name like nutzer_zähler (German for user_counter)? Does the included library figure out that both words can originate from the same alphabet?

@popzxc
Copy link
Contributor Author

popzxc commented Jun 19, 2021

What would the lint say about a value name like nutzer_zähler (German for user_counter)?

Good question and great test case! Added it, and yes, it does not spawn the lint (I'm not a specialist in unicode by any means, but AFAIK diaeresis doesn't move ä into any different locale).

Copy link
Member

@flip1995 flip1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping @Manishearth since you maintain the unicode crates in question here. Do you think adding this dep to Clippy (with the same configuration as in rustc) is fine?

clippy_lints/Cargo.toml Outdated Show resolved Hide resolved
Co-authored-by: Philipp Krones <hello@philkrones.com>
@Manishearth
Copy link
Member

Manishearth commented Jun 21, 2021

Strongly against this: I designed the lints that are in rustc for this and they were pretty carefully designed, over months of discussion. This particular thing was not considered to be a case worth addressing because idents like try_看 actually do make sense to have; and a core principle behind the design of those lints was to avoid malicious and confusing situations without actually harming legitimate use.

I don't understand why "handwriting the code is hard" is a concern at all; nor do I understand what you mean by the builtin getting implicitly shadowed.

@Manishearth
Copy link
Member

Manishearth commented Jun 21, 2021

Oh, I see what you mean by "shadowing". Yes, this was also considered when designing the rustc lint, and also determined to not be worth it: if someone has chosen to use Cyrillic in their code it's not worth it to try and nitpick "good usage" from "bad usage" because those can be pretty linked.

There are some other potential designs here: e.g. warning about mixed locales when it's only mixed-script confusables and only when not separated by underscores. But the current proposed design will catch too many legitimate use cases. Probably more than illegitimate ones, even, this kind of situation is super rare to trigger by accident.

It does sound like a reasonable style guideline to not mix scripts across underscores in a way that one of the scripts is purely using confusables.

@popzxc
Copy link
Contributor Author

popzxc commented Jun 21, 2021

That's why it's in clippy and not in the compiler, isn't it? It's completely OK to 'allow' lints you find not suiting your style.

I guess not every project will be happy with the approach in rustc (at least I am personally have plans to actively dogfood this lint in projects I participate in).

Maybe just put this lint into 'pedantic' category?

@Manishearth
Copy link
Member

That's why it's in clippy and not in the compiler, isn't it? It's completely OK to 'allow' lints you find not suiting your style.

Yes, but adding it to clippy still makes it a value judgement; and I do not consider this a good value judgement in this case. I'm not making this comment as a personal comment of style, I'm making this comment as a clippy maintainer who thinks that we need to be certain of the value judgements we are making when we add lints and as the person who did the research and design of the non ascii idents RFC. The lint as currently posed would likely warn on a lot of good code, more than the bad code.

Also 99% of the people who would enable this by default probably would also be okay with #[warn(non_ascii_idents)] IMO. Otherwise I would think this makes some sense as a restriction lint.

There's a way to do this better that I already proposed, which would work as a pedantic lint, though it's trickier to implement.

@Manishearth
Copy link
Member

Another thing that I would be fine with adding would be a restriction lint that lets you allowlist the set of scripts allowed in your codebase based on a clippy config.

@popzxc
Copy link
Contributor Author

popzxc commented Jun 21, 2021

Well, actually it makes sense.

Gonna implement.

@popzxc
Copy link
Contributor Author

popzxc commented Jun 22, 2021

I think it's done.
You can see updated test examples here.

Does it look better now?

@flip1995
Copy link
Member

r? @Manishearth (reassigning since @phansch is taking a break)

@rust-highfive rust-highfive assigned Manishearth and unassigned phansch Jun 22, 2021
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should probably be a pedantic lint

still kinda feel like a restriction lint with a list of scripts is the more extensible way to go.

#[derive(Debug)]
enum Case {
/// E.g. `SomeStruct`, delimiter is uppercase letter.
Camel,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so a problem with this is that not all writing systems have uppercase. More thought needs to be put into how that will work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But perhaps using only mixed script confusables without underscores in a character is enough for us for now. Idk.

@popzxc
Copy link
Contributor Author

popzxc commented Jun 22, 2021

still kinda feel like a restriction lint with a list of scripts is the more extensible way to go.

I was going to implement it after this one gets merged, as script-detection dependency is introduced here.

I don't think that having two lints in one PR is going to make things easier, assuming that it's already kinda sensitive 😅

@popzxc
Copy link
Contributor Author

popzxc commented Jun 22, 2021

However, this going farther and farther from my original intent, and become more and more complicated.
I guess it makes no sense to build a hardly-usable Frankenstein with blurry use-cases (initially I wanted to provide simple lint that is pretty restrictive, but now it going towards an alternative version of mixed_script_confusables, which is definitely not something I was aiming for).

I'll close this PR and will implement a restriction lint for locales instead.

@popzxc popzxc closed this Jun 22, 2021
@Manishearth
Copy link
Member

Thank you!

@popzxc popzxc deleted the mixed-locale-ident branch June 25, 2021 09:37
bors added a commit that referenced this pull request Jun 30, 2021
New lint: `disallowed_script_idents`

This PR implements a new lint to restrict locales that can be used in the code,
as proposed in #7376.

Current concerns / unresolved questions:

- ~~Mixed usage of `script` (as a Unicode term) and `locale` (as something that is easier to understand for the broad audience). I'm not sure whether these terms are fully interchangeable and whether in the current form it is more confusing than helpful.~~ `script` is now used everywhere.
- ~~Having to mostly copy-paste `AllowedScript`. Probably it's not a big problem, as the list of scripts is standardized and is unlikely to change, and even if we'd stick to the `unicode_script::Script`, we'll still have to implement custom deserialization, and I don't think that it will be shorter in terms of the amount of LoC.~~ `unicode::Script` is used together with a filtering deserialize function.
- Should we stick to the list of "recommended scripts" from [UAX #31](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts) in the configuration?

*Please write a short comment explaining your change (or "none" for internal only changes)*

changelog: ``[`disallowed_script_idents`]``

r? `@Manishearth`
bors added a commit that referenced this pull request Jun 30, 2021
New lint: `disallowed_script_idents`

This PR implements a new lint to restrict locales that can be used in the code,
as proposed in #7376.

Current concerns / unresolved questions:

- ~~Mixed usage of `script` (as a Unicode term) and `locale` (as something that is easier to understand for the broad audience). I'm not sure whether these terms are fully interchangeable and whether in the current form it is more confusing than helpful.~~ `script` is now used everywhere.
- ~~Having to mostly copy-paste `AllowedScript`. Probably it's not a big problem, as the list of scripts is standardized and is unlikely to change, and even if we'd stick to the `unicode_script::Script`, we'll still have to implement custom deserialization, and I don't think that it will be shorter in terms of the amount of LoC.~~ `unicode::Script` is used together with a filtering deserialize function.
- Should we stick to the list of "recommended scripts" from [UAX #31](http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts) in the configuration?

*Please write a short comment explaining your change (or "none" for internal only changes)*

changelog: ``[`disallowed_script_idents`]``

r? `@Manishearth`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants